Solr ClassificationUpdateProcessorFactory Bayes: problem with labels - solr

I have encountered a very strange behaviour: to test the classifcation function in Solr, I have defined the following processor chain:
<updateRequestProcessorChain name="classification">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">content</str>
<str name="classField">cat_knn</str>
<str name="predictedClassField.maxCount">2</str>
<str name="algorithm">knn</str>
<str name="knn.k">10</str>
<str name="knn.minTf">1</str>
<str name="knn.minDf">1</str>
</processor>
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">content</str>
<str name="classField">cat_bayes</str>
<str name="predictedClassField.maxCount">2</str>
<str name="algorithm">bayes</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
As a test set I am using news categories, such as "business", "entertainment" etc.
The relevant fields are defined as follows:
<field name="cat_knn" type="text_en" indexed="true" stored="true" multiValued="true"/>
<field name="cat_bayes" type="text_en" indexed="true" stored="true" multiValued="true"/>
For the training set cat_knn and cat_bayes contain exactly the same category labels.
However, if I use the above chain to classify new documents, the cat_knn for new documents are labeled with the full label, i.e. "business" or "entertainment", whereas for the bayes algorithm the labels are cut and displayed as "busi" or "entertain". At the same time, a label like "sport" is properly recorded as "sport".
Any idea what might be going on here?

What you are seeing is the stemmed tokens for the field instead. On the SolrClassification wiki page it specifies that:
The field that contains the class of the document. It must appear in the indexed documents. If knn algorithm it must be stored. If bayes algorithm it must be indexed and ideally not heavily analysed.
This indicates that bayes uses the actual tokens, while knn uses the stored text for the field when outputting the class.
Change the field type to string or strings (single valued vs multivalued), or a text field with minimal analysis (maybe a KeywordTokenizer with only a LowercaseFilter or similar).

Related

Solr server Context Filtering in Auto suggester not working

I'm experiencing problem when I try to use Context Filtering with auto suggester. What I want is to filter the suggestions based on url field
Here is my searchComponent:
<lst name="suggester">
<str name="name">AnalyzingInfixSuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">main_title</str>
<str name="weightField">main_title</str>
<str name="contextField">url</str>
<str name="suggestAnalyzerFieldType">text_general</str>
</lst>
Here are the fields in my schema:
<field name="main_title" type="string" indexed="true" stored="true"/>
<field name="url" type="string" indexed="true" stored="true"/>
Example:
I'm searching for "aacsb" and I have two results, which is correct. One is in English and one in German. I want to filter them out and show only the German result.
My urls looks like this:
https://www.myWebsite.com/aacsb-dog-lion?german
https://www.myWebsite.com/aacsb-dog-lion?english
Here are my queries:
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=-url:english
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=-english
With these I'm receiving both results. It doesn't matter if we have the field name or not.
When I tried these
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=url:english
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=english
I don't receive any results.
I read the documentation several times:LINK, but I still can't make it work.
Any help is welcomed.
Thanks!
EDIT:
I pasted the wrong queries, this was the correct:
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=url:\*english\*
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=\*english\*

How to configure multiple contextfields in single solr suggester?

I am using apache solr to search records in my current application.
And I was able to filter the suggesions based on DocumentType by configuring the context field.
Now I want to add another context field like departmentType. I am not sure how to configure the suggester for multiple context fields.
This is the suggester that used with single context fields and this is working fine.
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggesterByName</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">fullName</str>
<str name="contextField">documentType</str>
<str name="suggestAnalyzerFieldType">text_general</str>
<str name="buildOnStartup">false</str>
</lst>
</searchComponent>
I refer this post
https://issues.apache.org/jira/browse/SOLR-7888
but still not clear how to configure multiple context fields in a single suggester .
You have to create a new field in your schema.xml as context_field.
This field should have multivalued=true
<field name="context_field" type="text_suggest" multiValued="true" indexed="true" stored="true"/>
Then you have to create this context_field as a list in json for indexing in solr.
"context_field" : ["some document type", "some department type"]
after indexing you can suggest like this-
suggest.q=b&suggest.cfq=context_documentType AND context_departmentType
Hope it works

Is there a multiValue field sort workaround in solr

I am trying to look for alternative ways to sort a multivalue field.
I know that this question has been asked before and the solutions talk about min and max but that is not the strategy i am looking for.
Is there a way we can do a COPY of the multivalue over to another field which can be used for sorting?
For example like this:
<field name="cat" type="string" indexed="true" stored="true"
multiValued="true"/>
<copyField source="cat" dest="firstcat"/>
<field name="firstcat" type="string" indexed="true" stored="false"
multiValued="false"/>
Answering my question.
The copyfield above will not work and will throw an exception when there is more than one value in the multivalue string. I mean, duh. Obviously.
One working solution is to use the updateRequestProcessorChain configuration in the solrconfig.xml and add it to the update handler chain.
Here is a sample:
<updateRequestProcessorChain name="concatFields">
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">str1</str>
<str name="dest">str2</str>
</processor>
<processor class="solr.ConcatFieldUpdateProcessorFactory">
<str name="fieldName">str2</str>
<str name="delimiter">_</str>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">str2</str>
<str name="dest">str3</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
And then chain the processor to the path:
<initParams path="/update/**">
<lst name="defaults">
<str name="update.chain">concatFields</str>
</lst>
</initParams>

Keep one entry of duplicate articles with SOLR deduplication

I have used Solr deduplication with following setting in solrconfig.xml
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signature</str>
<bool name="overwriteDupes">true</bool>
<str name="fields">description</str>
<str name="signatureClass">solr.processor.TextProfileSignature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
and in schema.xml
<field name="signature" type="string" stored="true" indexed="true" multiValued="false" />
My objective is to find documents with duplicate descriptions (used TextProfileSignature for near duplicate) keep one entry and remove other duplicate entries.
for e.g.
doc1
description : Websol – Candidate should be good in communication and computer skills
must be willing to relocate
We have good vacancies for Back Office in international call centers
doc2
description :Websol – Candidate should be good in communication and computer skills
must be willing to relocate
We have good vacancies for Back Office in international call centers...
from these two docs only one to be deleted not both but with solr dedupe both entries get deleted.
Let me know if i am missing aything in setting or i need to follow other way to achieve this.
Could be you are suffering from a known issue

Mapping fields in SOLR for faceting

I'm indexing rich text documents into SOLR 3.4 using ExtractingRequestHandler and I'm having trouble getting it to behave like I want it to.
I would like to store creation date as a field to use for faceted search later and have defined the following in schema.xml:
<field name="creation_date" type="date" indexed="true" stored="true"/>
I index like this:
curl -s "http://localhost:8983/solr/update/extract?literal.id=myid&resource.name=myfile.xls&commit=true" -F myfile=#/path/to/myfile.xls
I get the dynamic field attr_creation_date (that other rules make sure), but I don't get it as creation_date. I have also unsuccessfully tried to use copyField like so:
<copyField source="attr_creation_date" dest="creation_date"/>
Yet another try was putting this in solrconfig.xml, but no luck:
<str name="fmap.Creation-Date">creation_date</str>
I'm pretty sure I'm missing something basic here. Any help is most appreciated!
Settings for ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.content">text</str>
<str name="fmap.Last-Save-Date">last_save_date</str>
<str name="fmap.Creation-Date">creation_date</str>
<str name="fmap.Content-Type">content_type</str>
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
</lst>
</requestHandler>
My schema.xml file (lots of default stuff): https://gist.github.com/1358002

Resources