textual content without metadata from Tika via SolrCell

textual content without metadata from Tika via SolrCell - solr

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text content of the document.
I would like to provide some snippet highlighting of the content and the subject metadata within the content field is skewing the highlight results.
UPDATE: Screenshot of Tika output as indexed by Solr. Highlighted portion is the block of metadata that gets prepended as a block of text to the PDF content.
The ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>
Schema.xml fields. Note "content" receives Tika's content output directly. The "page" and "collection" fields are set with literal values when a doc is posted to the handler.
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="collection" type="text_general" indexed="true" stored="true"/>
<field name="page" type="tint" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

As all other answers are completely irrelevant, I'll post mine:
I have experienced exactly the same problem as OP describes, (Solr 4.3.0, custom config, custom schema, etc. I'm not newbie or something and understand Solr internals pretty well)
This was my ERH config:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">false</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
It was basically configured to ignore everything except the content (i believe it's reasonable for many people).
After careful investigation i found out, that
<str name="captureAttr">false</str>
was the thing caused OP's issue. By default it is turned on, but i turned it off as i did not need it anyway. And that was my mistake. I have no idea why, but it causes Solr to put extracted attributes into fmap.content field altogether with extracted text.
So the solution is to turn it back on.
Final ERH:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">true</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
Now, only extracted text is put to fmap.content field.
Unfortunately i have not found any piece of documentation which can explain this. Either bug or just stupid behavior

Tika with Solr produces different fields for the content and the metadata.
If you use the Standard ExtractingRequestHandler -
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
The field map content is set to text field which should be only the content of your pdf.
The other metadata fields can be easily checked by modifying the schema.xml.
mark stored true for igonred field type
<fieldtype name="ignored" stored="true" indexed="false" multiValued="true" class="solr.StrField" />
Capture all fields -
<dynamicField name="*" type="ignored" multiValued="true" />
Tika adds lot of fields for the metadata with the content being set separately e.g. response when fed extract handler with a ppt.
<doc>
<arr name="application_name">
<str>Microsoft PowerPoint</str>
</arr>
<str name="category">POT - US</str>
<str name="comments">version 1.1</str>
<arr name="company">
<str>
</str>
</arr>
<arr name="content_type">
<str>application/vnd.ms-powerpoint</str>
</arr>
<arr name="creation_date">
<str>2000-03-15T16:57:27Z</str>
</arr>
<arr name="custom_delivery_date">
<str>
</str>
</arr>
<arr name="custom_docid">
<str>
</str>
</arr>
<arr name="custom_docidinslide">
<str>true</str>
</arr>
<arr name="custom_docidintitle">
<str>true</str>
</arr>
<arr name="custom_docidposition">
<str>0</str>
</arr>
<arr name="custom_event">
<str>
</str>
</arr>
<arr name="custom_final">
<str>false</str>
</arr>
<arr name="custom_mckpapersize">
<str>US</str>
</arr>
<arr name="custom_notespagelayout">
<str>Lower</str>
</arr>
<arr name="custom_title">
<str>Lower Universal Template US</str>
</arr>
<arr name="custom_universal_objects">
<str>true</str>
</arr>
<arr name="edit_time">
<str>284587970000</str>
</arr>
<str name="id">101</str>
<arr name="ignored_">
<str>slideShow</str>
<str>slide</str>
<str>slide</str>
<str>slideNotes</str>
</arr>
<str name="keywords">test</str>
<arr name="last_author">
<str>Corporate</str>
</arr>
<arr name="last_printed">
<str>2000-03-17T20:28:57Z</str>
</arr>
<arr name="last_save_date">
<str>2009-03-24T16:52:26Z</str>
</arr>
<arr name="manager">
<str>
</str>
</arr>
<arr name="meta">
<str>stream_source_info</str>
<str>file:/C:/temp/nuggets/100000.ppt</str>
<str>Last-Author</str>
<str>Corporate</str>
<str>Slide-Count</str>
<str>2</str>
<str>custom:DocIDPosition</str>
<str>0</str>
<str>Application-Name</str>
<str>Microsoft PowerPoint</str>
<str>custom:Delivery Date</str>
<str>
</str>
<str>custom:Event</str>
<str>
</str>
<str>Edit-Time</str>
<str>284587970000</str>
<str>Word-Count</str>
<str>120</str>
<str>Creation-Date</str>
<str>2000-03-15T16:57:27Z</str>
<str>stream_size</str>
<str>181248</str>
<str>Manager</str>
<str>
</str>
<str>stream_name</str>
<str>100000.ppt</str>
<str>Company</str>
<str>
</str>
<str>Keywords</str>
<str>test</str>
<str>Last-Save-Date</str>
<str>2009-03-24T16:52:26Z</str>
<str>Revision-Number</str>
<str>91</str>
<str>Last-Printed</str>
<str>2000-03-17T20:28:57Z</str>
<str>Comments</str>
<str>version 1.1</str>
<str>Template</str>
<str>
</str>
<str>custom:PaperSize</str>
<str>US</str>
<str>custom:DocID</str>
<str>
</str>
<str>xmpTPg:NPages</str>
<str>2</str>
<str>custom:NotesPageLayout</str>
<str>Lower</str>
<str>custom:DocIDinSlide</str>
<str>true</str>
<str>Category</str>
<str>POT - US</str>
<str>custom:Universal Objects</str>
<str>true</str>
<str>custom:Final</str>
<str>false</str>
<str>custom:DocIDinTitle</str>
<str>true</str>
<str>Content-Type</str>
<str>application/vnd.ms-powerpoint</str>
<str>custom:Title</str>
<str>test</str>
</arr>
<arr name="p">
<str>slide-content</str>
<str>slide-content</str>
</arr>
<arr name="revision_number">
<str>91</str>
</arr>
<arr name="slide_count">
<str>2</str>
</arr>
<arr name="stream_name">
<str>100000.ppt</str>
</arr>
<arr name="stream_size">
<str>181248</str>
</arr>
<arr name="stream_source_info">
<str>file:/C:/temp/test/100000.ppt</str>
</arr>
<arr name="template">
<str>
</str>
</arr>
<!-- Content field -->
<arr name="text">
<str>test Test test test test tes t</str>
</arr>
<arr name="title">
<str>test</str>
</arr>
<arr name="word_count">
<str>120</str>
</arr>
<arr name="xmptpg_npages">
<str>2</str>
</arr>
</doc>

I no longer have the problem I described above. Since asking the question, I have updated to Solr 4.0 alpha and recreated schema.xml from the Solr Cell example that ships with the 4.0a package. I suspect my original schema was copying the metadata fields' content to the text field, so it was most likely my own error.

In the solrconfig.xml, where the request handler is defined, add this line below
<str name="fmap.title">ignored_</str>
This tells Tika to simply ignore the title attribute (or which ever attributes you want ignored) it finds embedded within the PDF.

In my case, <str name="xpath">/xhtml:html/xhtml:body//node()</str> allowed extraction of content without the meta.
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content</str>
<!-- Specify where content should be extracted exactly -->
<str name="xpath">/xhtml:html/xhtml:body//node()</str>
</lst>
</requestHandler>

Related

Couldn't get data in suggester even when storeDir getting created by FileDictionaryFactory

This is a follow up question of this question. I have a list of cities onto which I want to implement spell-checker. I have the priorities/weights of these cities with me. I tried implementing a solrsuggester with a FileDictionaryFactory as a base with the following format:
<city-name> <TAB> <weight> <TAB> <other parameters like citycode,country>
I am passing other attributes like citycode, country etc as pipe separated payload string.
Here's my solrconfig
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">FileDictionaryFactory</str>
<str name="field">name</str>
<str name="weightField">searchscore</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
<str name="sourceLocation">spellings.txt</str>
<str name="storeDir">autosuggest_dict</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">mySuggester</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
and my schema
<field name="name" type="string" indexed="true" stored="true" multiValued="false" />
<field name="countrycode" type="string" indexed="true" stored="true" multiValued="false" />
<field name="latlng" type="location" indexed="true" stored="true" multiValued="false" />
<field name="searchfield" type="text_ngram" indexed="true" stored="false" multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />
<uniqueKey>id</uniqueKey>
<defaultSearchField>searchfield</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<copyField source="name" dest="searchfield"/>
Now the problem I am facing is I am getting 0 results for each and every search query. Even though I can see the storeDir getting created and it has a bin file with data looks like my payload data.
This is the url format I am using
/suggest?suggest=true&suggest.dictionary=mySuggester&wt=json&suggest.q=cologne
So, I have the following questions:
What does the creation of storeDir signify ? Is it indexed successfully
If yes, then what's wrong with my query ? If no, Am I missing something here(indexPath ???).
Is it the right way to supply search parameters on payload field ? If no, is there any other way ?

There is slight change in your solrconfig.xml. you need to remove buildOnStartup from suggester configuration or set it true.
[solrconfig.xml]
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">FileDictionaryFactory</str>
<str name="field">name</str>
<str name="weightField">searchscore</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">true</str>
<str name="sourceLocation">spellings.txt</str>
<str name="storeDir">autosuggest_dict</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">mySuggester</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
There is a problem in File Based Suggester that it will not build their suggestions through query by setting suggest=true. You need to build the File Based Suggestion on startup.

I was using searchfield as defaultSearchField in schema, but had configured name as suggest field. The moment I changed field to searchfield and suggestAnalyzerFieldType to text_ngram, it started working.
Here is the working solrconfig:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggestions</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">FileDictionaryFactory</str>
<str name="field">searchfield</str>
<str name="weightField">searchscore</str>
<str name="suggestAnalyzerFieldType">text_ngram</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">false</str>
<str name="sourceLocation">spellings.txt</str>
<str name="storeDir">autosuggest_dict</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">suggestions</str>
<str name="suggest.dictionary">results</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>

Solr Spellcheck request returns nothing

I actually use Solr 4.8.1 and I set up spellcheck. After indexing, the request doesn't return any suggestion.
After the advice of #n0tting, I modified a little my files.
Here are steps:
1- solrconfig.xml
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">phraseText</str>
<lst name="spellchecker">
<str name="classname">solr.IndexBasedSpellChecker</str>
<str name="spellcheckIndexDir">./spellchecker</str>
<str name="name">default</str>
<str name="field">title_spellcheck</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
add some configurations in standard requestHandler:
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<!-- default values for query parameters -->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<!-- Optional, must match spell checker's name as defined above, defaults to "default" -->
<str name="spellcheck.dictionary">default</str>
<!-- omp = Only More Popular -->
<str name="spellcheck.onlyMorePopular">false</str>
<!-- exr = Extended Results -->
<str name="spellcheck.extendedResults">false</str>
<!-- The number of suggestions to return -->
<str name="spellcheck.count">1</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
2 schema.xml
Define a field for spell check:
<field name="title_spellcheck" type="phraseText" indexed="true" stored="false" multiValued="true" />
<copyField source="title" dest="title_spellcheck"/>
3 Request:
.../select?q=recommend&defType=edismax&qf=title&spellcheck=true&spellcheck.build=true&spellcheck.q=recommend&spellcheck.collate=true
I don't get any suggestion at result, neither <lst name="spellcheck">. can anybody give me an advice? Thanks a lot.
References:
https://cwiki.apache.org/confluence/display/solr/Spell+Checking
http://solr.pl/en/2011/05/23/%E2%80%9Ccar-sale-application%E2%80%9D-%E2%80%93-spellcheckcomponent-%E2%80%93-did-you-really-mean-that-part-5/

Solr autocomplete on words not in beginning of string

I have some docs in Solr that contains information about books. One of the fields is author, defined as:
<field name="author" type="text_general" indexed="true" stored="true"/>
This is an example of a doc:
<doc>
<str name="id">db04</str>
<str name="isbn">0596529325</str>
<str name="author">Toby Segaran</str>
<str name="category">Computers/Programming/Information Retrieval/Machine Learning</str>
<arr name="title">
<str>Programming Collective Intelligence</str>
</arr>
<int name="yearpub">2007</int>
<date name="pubdate">2007-07-28T00:00:01Z</date>
</doc>
I'm trying to create a autocomplete system using Solr 4.2. So far it worked well, if I search for to it returns me Toby Segaran as the result.
But in our website many people searches for Segaran for instance and I was wondering if is it possible to somehow suggest Toby Segaran when this happens.
So far this is the schema.xml I'm using:
<field name="author_suggest" type="text_auto" indexed="true" stored="true" multiValued="false"/>
<copyField source="author" dest="author_suggest"/>
<fieldType class="solr.TextField" name="text_auto">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Basically the field author is processed and copied to author_suggest.
In solrconfig.xml, these were created:
<searchComponent class="solr.SpellCheckComponent" name="suggest">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookupFactory</str>
<str name="field">author_suggest</str>
<float name="threshold">0.005</float>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">6</str>
<str name="spellcheck.collate">false</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
So is this possible to somehow make suggestions based on words that are not exactly at the beginning of the phrase using the suggester from Solr?
If you need more information please let me know.
Thanks in advance

Basic UIMA with SOLR

I am trying to connect UIMA with Solr. I have downloaded the Solr 3.5 dist and have it successfully running with nutch and tika on windows 7 using solrcell and curl via cygwin.
To begin, I copied the 6 jars from solr/contrib/uima/lib to the working /lib in solr.
Next, I read the readme.txt file in solr/contrib/uima/lib and edited both my solrconfig.xml and schema.xml to no avail.
I then found this link which seemed a bit more applicable since I didnt care to use Alchemy or OpenCalais: http://code.google.com/a/apache-extras.org/p/rondhuit-uima/?redir=1
Still- when I run a curl command that imports a pdf via solrcell I do not get the additional UIMA fields nor do I get anything on my logs. The test.pdf is parsed though and I see the pdf in Solr using:
curl 'http://localhost:8080/solr/update/extract?fmap.content=content&literal.id=doc1&commit=true' -F "file=#test.pdf"
SolrConfig.XML
<updateRequestProcessorChain name="uima">
<processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
<lst name="uimaConfig">
<lst name="runtimeParameters">
<str name="host">http://localhost</str>
<str name="port">8080</str>
</lst>
<str name="analysisEngine">C:\uima\desc\com\rondhuit\uima\desc\NextAnnotatorDescriptor.xml</str>
<bool name="ignoreErrors">true</bool>
<str name="logField">id</str>
<lst name="analyzeFields">
<bool name="merge">false</bool>
<arr name="fields">
<str>content</str>
</arr>
</lst>
<lst name="fieldMappings">
<lst name="type">
<str name="name">com.rondhuit.uima.next.NamedEntity</str>
<lst name="mapping">
<str name="feature">entity</str>
<str name="fieldNameFeature">uname</str>
<str name="dynamicField">*_sm</str>
</lst>
</lst>
</lst>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update/uima" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">uima</str>
</lst>
</requestHandler>
AND I ALSO ADJUSTED MY requestHander:
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">uima</str>
</lst>
</requestHandler>
Schema.XML
<!-- fields for UIMA -->
<field name="uname" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
<dynamicField name="*_sm" type="string" indexed="true" stored="true"/>
All I am trying to do is have UIMA pull out names from text (just to start as a demo) and cannot figure out what I am doing wrong.
Thank you in advance for reading this.

Not sure if this ever got addressed, but in case someone else is looking, I had this same problem yesterday. Figured out that I was calling /update/extract to use solrcell, which doesn't use uima because it's integrated into /update.

Solr highlighting with unexpected prefix and suffix

I need to customize Solr highlighting prefix and suffix like this:
<span class="highlight">text</span>
instead of the default
<em>text</em>
That's why I'm using this configuration within the solrconfig.xml for the HighlightComponent:
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
<fragmentsBuilder name="simple" default="true" class="solr.highlight.SimpleFragmentsBuilder">
<lst name="defaults">
<str name="hl.tag.pre"><![CDATA[<span class="highlight">]]></str>
<str name="hl.tag.post"><![CDATA[</span>]]></str>
</lst>
</fragmentsBuilder>
</highlighting>
</searchComponent>
The following are the default parameters for my standard request handler:
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="hl">true</str>
<str name="hl.fl">body,title</str>
<str name="hl.useFastVectorHighlighter">true</str>
</lst>
</requestHandler>
When I search for the text word I do get the text word highlighted, but not always using the prefix and suffix I configured:
<lst name="highlighting">
<lst name="document_1">
<arr name="body">
<str>my <em>text</em> highlighted</str>
</arr>
<arr name="title">
<str>my <span class="highlight">text</span> highlighted</str>
</arr>
</lst>
</lst>
Does anybody know why?

I am guessing you are seeing this behavior behavior because you only have the prefix and suffix defined for the SimpleFragmentsBuilder and the other highlights are coming from another fragment builder.
I am using a custom prefix and suffix for my highlighting and I set this value in the formatter section of the highlighting section of the solrconfig.xml and have not had any issues as it will apply to all fragment builders.
So maybe try the following:
<highlighting>
<fragmentsBuilder name="simple" default="true"
class="solr.highlight.SimpleFragmentsBuilder"/>
<!-- Configure the standard formatter -->
<formatter name="html" class="org.apache.solr.highlight.HtmlFormatter"
default="true">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<span class="highlight">]]></str>
<str name="hl.simple.post"><![CDATA[</span>]]></str>
</lst>
</formatter>
</highlighting>

I finally found out why! I'm using fastVectorHighlighter to make highlighting faster.
At the beginning I was highlighting only the title field and everything worked fine.
When I added the body field to highlighting I forgot to enable termVectors=true.
Now that my body field looks like this
<field name="body" type="text" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
after a full reindex highlighting is working perfectly:
<lst name="highlighting">
<lst name="document_1">
<arr name="body">
<str>my <span class="highlight">text</span> highlighted</str>
</arr>
<arr name="title">
<str>my <span class="highlight">text</span> highlighted</str>
</arr>
</lst>
</lst>
Previously the body field highlighting did work, but without fastVectorHighlighter since the field didn't have the termVectors=true parameter. That's why I got body highlighted with default prefix and suffix. Since fastVectorHighlighter is a completely different highlighting method, the configuration is different as well.
To avoid this kind of mistakes, as long the users can choose what fields to highlight with the hl.fl parameter, I'd recommend to include also the configuration for the standard highlighting (formatter element, class solr.highlight.HtmlFormatter) like this:
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
<formatter name="html" default="true" class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<span class="highlight">]]></str>
<str name="hl.simple.post"><![CDATA[</span>]]></str>
</lst>
</formatter>
<fragmentsBuilder name="simple" default="true" class="solr.highlight.SimpleFragmentsBuilder">
<lst name="defaults">
<str name="hl.tag.pre"><![CDATA[<span class="highlight">]]></str>
<str name="hl.tag.post"><![CDATA[</span>]]></str>
</lst>
</fragmentsBuilder>
</highlighting>
</searchComponent>
This way highlighting will work with the same prefix and suffix even for fields with termVectors disabled.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

textual content without metadata from Tika via SolrCell - solr

In the solrconfig.xml, where the request handler is defined, add this line below <str name="fmap.title">ignored_</str> This tells Tika to simply ignore the title attribute (or which ever attributes you want ignored) it finds embedded within the PDF.

Related

Couldn't get data in suggester even when storeDir getting created by FileDictionaryFactory

Solr Spellcheck request returns nothing

Solr autocomplete on words not in beginning of string

Basic UIMA with SOLR

Solr highlighting with unexpected prefix and suffix

Categories

Resources