SOLR autocomplete for multiple terms phrase - solr

I've a question re autocomplete in Solr - say there is a multi words string "nice cheap laptop" which should be suggested to users in case they type 'nice', 'cheap' or 'laptop'. How to achieve that with Solr?
I'm trying to migrate to SOLR a code that currently works with ElasticSearch - for ES the mapping is provided with type 'completion', for which I configure all permutations of the terms in the phrase as input to search against, and output is the original phrase. Couldn't find in the docs if/how this is possible with SOLR.
EDIT:
I tried adding the following to solrconfig.xml:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">name</str>
<!--str name="weightField">price</str-->
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler"
startup="lazy" >
<lst name="defaults">
<str name="suggest.dictionary">mySuggester</str>
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
And the following to managed schema:
<field name="productNameId" type="string" indexed="true" stored="true"/>
<field name="aspectId" type="pint" indexed="true" stored="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="categoryId" type="string" indexed="true" stored="true"/>
Then indexed 3 documents with solrj:
String urlString = "http://localhost:8983/solr/aspects";
HttpSolrClient client = new HttpSolrClient.Builder(urlString).build();
client.setParser(new XMLResponseParser() );
ProductAspects pa1 = new ProductAspects();
pa1.setId("1");
pa1.setAspectId(1);
pa1.setName("alice");
ProductAspects pa2 = new ProductAspects();
pa2.setId("2");
pa2.setAspectId(2);
pa2.setName("alza");
ProductAspects pa3 = new ProductAspects();
pa3.setId("3");
pa3.setAspectId(3);
pa3.setName("alza bob");
final UpdateResponse res1 = client.addBean( pa1 );
final UpdateResponse res2 = client.addBean( pa2 );
final UpdateResponse res3 = client.addBean( pa3 );
UpdateResponse res = client.commit();
After that, I would expect that typing 'alz' would return just 2 docs, but it returns all 3 docs:
http://localhost:8983/solr/aspects/suggest?suggest.dictionary=mySuggester&suggest=true&suggest.build=true&suggest.q=alz
Can you please assist what is the correct config for autocomplete with Solr?

Related

Solr 7.x "/export" response handler not working with Streaming Expressions

I am doing a Solr streaming expression, and I try to use the /export handler to fetch all of the results. Consider the following query:
search(main, q=*:*, fl="SSRN",qt="/export",sort="SSRN asc")
I configured my schema.xml for the SSRN field as follows:
<field name="SSRN" type="int" indexed="true" stored="true" required="false" multiValued="false" docValues="true" />
Since the SSRN field is a docValue, it should work. The results are just the standard 10 documents. This is running in a SolrCloud environment with just one node and one shard.
Thanks in advance!
I fixed the issue. It seems that in SOLR-8426: Enable /export, /stream and /sql handlers by default and remove them from example configs, they removed the need to add /export handler to the solrconfig.xml. If you do add it, then it doesn't work. The solution is just to remove this code (from solrconfig.xml):
<requestHandler name="/export" class="solr.SearchHandler">
<lst name="invariants">
<str name="rq">{!xport}</str>
<str name="wt">xsort</str>
<str name="distrib">false</str>
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
</lst>
</requestHandler>

Solr autocomplete on words not in beginning of string

I have some docs in Solr that contains information about books. One of the fields is author, defined as:
<field name="author" type="text_general" indexed="true" stored="true"/>
This is an example of a doc:
<doc>
<str name="id">db04</str>
<str name="isbn">0596529325</str>
<str name="author">Toby Segaran</str>
<str name="category">Computers/Programming/Information Retrieval/Machine Learning</str>
<arr name="title">
<str>Programming Collective Intelligence</str>
</arr>
<int name="yearpub">2007</int>
<date name="pubdate">2007-07-28T00:00:01Z</date>
</doc>
I'm trying to create a autocomplete system using Solr 4.2. So far it worked well, if I search for to it returns me Toby Segaran as the result.
But in our website many people searches for Segaran for instance and I was wondering if is it possible to somehow suggest Toby Segaran when this happens.
So far this is the schema.xml I'm using:
<field name="author_suggest" type="text_auto" indexed="true" stored="true" multiValued="false"/>
<copyField source="author" dest="author_suggest"/>
<fieldType class="solr.TextField" name="text_auto">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Basically the field author is processed and copied to author_suggest.
In solrconfig.xml, these were created:
<searchComponent class="solr.SpellCheckComponent" name="suggest">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookupFactory</str>
<str name="field">author_suggest</str>
<float name="threshold">0.005</float>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">6</str>
<str name="spellcheck.collate">false</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
So is this possible to somehow make suggestions based on words that are not exactly at the beginning of the phrase using the suggester from Solr?
If you need more information please let me know.
Thanks in advance

integrating solr autosuggest functionality error

I am trying to integrate auto suggest functionality of solr in my project. I use this as my starting point. I changed my searched fields accordingly.
my schema.xml
<field name="name" type="text_suggest" indexed="true" stored="true"/>
<field name="manu" type="text_suggest" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
<!-- A variant of textsuggest which only matches from the very left edge -->
<copyField source="name" dest="textnge"/>
<field name="textnge" type="autocomplete_edge" indexed="true" stored="false" />
<!-- A variant of name which matches from the left edge of all terms (implicit truncation) -->
<copyField source="name" dest="textng"/>
<field name="textng" type="autocomplete_ngram" indexed="true" stored="false" omitNorms="true" omitTermFreqAndPositions="true" />
My request handler in solrconfig.xml
<requestHandler class="solr.SearchHandler" name="/ac" default="true" >
<lst name="defaults">
<str name="defType">edismax</str>
<str name="rows">10</str>
<str name="fl">*,score</str>
<str name="qf">name^50 manu^20.0 textng^50.0</str>
<str name="pf">textnge^50.0</str>
<str name="bf">product(log(sum(popularity,1)),100)^20</str>
<str name="debugQuery">false</str>
</lst>
</requestHandler>
The problem is that my "/ac" handler is acting more like "/select" handler. When I type "moni" I am getting nothing. But when I type "monitor", its returning me the documents containing monitor in them.
I have been trying this for whole day and nothing seems to work. Any help will be deeply appreciated
Well when you look for "moni" in your query, you are actually specifically saying that you're looking for the "moni" keyword. Try looking for multiterms keywrods by adding "*", such as q=moni*.
You can also look in other fieldType analyser like autocomplete_edge (q=textnge:mori) or autocomplete_ngram (q=textng:mori) for more data.
I think you need to specify search component in solarconfig.xml like below
<searchComponent class="solr.SpellCheckComponent" name="ac">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookupFactory</str>
<str name="field">yourfieldname</str> <!-- the indexed field to derive suggestions from -->
<float name="threshold">0.005</float>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>

textual content without metadata from Tika via SolrCell

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text content of the document.
I would like to provide some snippet highlighting of the content and the subject metadata within the content field is skewing the highlight results.
UPDATE: Screenshot of Tika output as indexed by Solr. Highlighted portion is the block of metadata that gets prepended as a block of text to the PDF content.
The ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>
Schema.xml fields. Note "content" receives Tika's content output directly. The "page" and "collection" fields are set with literal values when a doc is posted to the handler.
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="collection" type="text_general" indexed="true" stored="true"/>
<field name="page" type="tint" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
As all other answers are completely irrelevant, I'll post mine:
I have experienced exactly the same problem as OP describes, (Solr 4.3.0, custom config, custom schema, etc. I'm not newbie or something and understand Solr internals pretty well)
This was my ERH config:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">false</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
It was basically configured to ignore everything except the content (i believe it's reasonable for many people).
After careful investigation i found out, that
<str name="captureAttr">false</str>
was the thing caused OP's issue. By default it is turned on, but i turned it off as i did not need it anyway. And that was my mistake. I have no idea why, but it causes Solr to put extracted attributes into fmap.content field altogether with extracted text.
So the solution is to turn it back on.
Final ERH:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">true</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
Now, only extracted text is put to fmap.content field.
Unfortunately i have not found any piece of documentation which can explain this. Either bug or just stupid behavior
Tika with Solr produces different fields for the content and the metadata.
If you use the Standard ExtractingRequestHandler -
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
The field map content is set to text field which should be only the content of your pdf.
The other metadata fields can be easily checked by modifying the schema.xml.
mark stored true for igonred field type
<fieldtype name="ignored" stored="true" indexed="false" multiValued="true" class="solr.StrField" />
Capture all fields -
<dynamicField name="*" type="ignored" multiValued="true" />
Tika adds lot of fields for the metadata with the content being set separately e.g. response when fed extract handler with a ppt.
<doc>
<arr name="application_name">
<str>Microsoft PowerPoint</str>
</arr>
<str name="category">POT - US</str>
<str name="comments">version 1.1</str>
<arr name="company">
<str>
</str>
</arr>
<arr name="content_type">
<str>application/vnd.ms-powerpoint</str>
</arr>
<arr name="creation_date">
<str>2000-03-15T16:57:27Z</str>
</arr>
<arr name="custom_delivery_date">
<str>
</str>
</arr>
<arr name="custom_docid">
<str>
</str>
</arr>
<arr name="custom_docidinslide">
<str>true</str>
</arr>
<arr name="custom_docidintitle">
<str>true</str>
</arr>
<arr name="custom_docidposition">
<str>0</str>
</arr>
<arr name="custom_event">
<str>
</str>
</arr>
<arr name="custom_final">
<str>false</str>
</arr>
<arr name="custom_mckpapersize">
<str>US</str>
</arr>
<arr name="custom_notespagelayout">
<str>Lower</str>
</arr>
<arr name="custom_title">
<str>Lower Universal Template US</str>
</arr>
<arr name="custom_universal_objects">
<str>true</str>
</arr>
<arr name="edit_time">
<str>284587970000</str>
</arr>
<str name="id">101</str>
<arr name="ignored_">
<str>slideShow</str>
<str>slide</str>
<str>slide</str>
<str>slideNotes</str>
</arr>
<str name="keywords">test</str>
<arr name="last_author">
<str>Corporate</str>
</arr>
<arr name="last_printed">
<str>2000-03-17T20:28:57Z</str>
</arr>
<arr name="last_save_date">
<str>2009-03-24T16:52:26Z</str>
</arr>
<arr name="manager">
<str>
</str>
</arr>
<arr name="meta">
<str>stream_source_info</str>
<str>file:/C:/temp/nuggets/100000.ppt</str>
<str>Last-Author</str>
<str>Corporate</str>
<str>Slide-Count</str>
<str>2</str>
<str>custom:DocIDPosition</str>
<str>0</str>
<str>Application-Name</str>
<str>Microsoft PowerPoint</str>
<str>custom:Delivery Date</str>
<str>
</str>
<str>custom:Event</str>
<str>
</str>
<str>Edit-Time</str>
<str>284587970000</str>
<str>Word-Count</str>
<str>120</str>
<str>Creation-Date</str>
<str>2000-03-15T16:57:27Z</str>
<str>stream_size</str>
<str>181248</str>
<str>Manager</str>
<str>
</str>
<str>stream_name</str>
<str>100000.ppt</str>
<str>Company</str>
<str>
</str>
<str>Keywords</str>
<str>test</str>
<str>Last-Save-Date</str>
<str>2009-03-24T16:52:26Z</str>
<str>Revision-Number</str>
<str>91</str>
<str>Last-Printed</str>
<str>2000-03-17T20:28:57Z</str>
<str>Comments</str>
<str>version 1.1</str>
<str>Template</str>
<str>
</str>
<str>custom:PaperSize</str>
<str>US</str>
<str>custom:DocID</str>
<str>
</str>
<str>xmpTPg:NPages</str>
<str>2</str>
<str>custom:NotesPageLayout</str>
<str>Lower</str>
<str>custom:DocIDinSlide</str>
<str>true</str>
<str>Category</str>
<str>POT - US</str>
<str>custom:Universal Objects</str>
<str>true</str>
<str>custom:Final</str>
<str>false</str>
<str>custom:DocIDinTitle</str>
<str>true</str>
<str>Content-Type</str>
<str>application/vnd.ms-powerpoint</str>
<str>custom:Title</str>
<str>test</str>
</arr>
<arr name="p">
<str>slide-content</str>
<str>slide-content</str>
</arr>
<arr name="revision_number">
<str>91</str>
</arr>
<arr name="slide_count">
<str>2</str>
</arr>
<arr name="stream_name">
<str>100000.ppt</str>
</arr>
<arr name="stream_size">
<str>181248</str>
</arr>
<arr name="stream_source_info">
<str>file:/C:/temp/test/100000.ppt</str>
</arr>
<arr name="template">
<str>
</str>
</arr>
<!-- Content field -->
<arr name="text">
<str>test Test test test test tes t</str>
</arr>
<arr name="title">
<str>test</str>
</arr>
<arr name="word_count">
<str>120</str>
</arr>
<arr name="xmptpg_npages">
<str>2</str>
</arr>
</doc>
I no longer have the problem I described above. Since asking the question, I have updated to Solr 4.0 alpha and recreated schema.xml from the Solr Cell example that ships with the 4.0a package. I suspect my original schema was copying the metadata fields' content to the text field, so it was most likely my own error.
In the solrconfig.xml, where the request handler is defined, add this line below
<str name="fmap.title">ignored_</str>
This tells Tika to simply ignore the title attribute (or which ever attributes you want ignored) it finds embedded within the PDF.
In my case, <str name="xpath">/xhtml:html/xhtml:body//node()</str> allowed extraction of content without the meta.
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content</str>
<!-- Specify where content should be extracted exactly -->
<str name="xpath">/xhtml:html/xhtml:body//node()</str>
</lst>
</requestHandler>

Solr spellcheck configuration

I am trying to build the spellcheck index with IndexBasedSpellChecker
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">text</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
And I want to specify the dynamic field "*_text" as the field option:
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true">
How it can be done?
Copy all the text fields to one field:
<copyField source="*_text" dest="textSpell" />
and then build spellcheck index from field "textSpell"
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">textSpell</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
This will be helpful
Implementation of solr spellchecker and
spellCheckComponent

Resources