Mapping fields in SOLR for faceting - solr

I'm indexing rich text documents into SOLR 3.4 using ExtractingRequestHandler and I'm having trouble getting it to behave like I want it to.
I would like to store creation date as a field to use for faceted search later and have defined the following in schema.xml:
<field name="creation_date" type="date" indexed="true" stored="true"/>
I index like this:
curl -s "http://localhost:8983/solr/update/extract?literal.id=myid&resource.name=myfile.xls&commit=true" -F myfile=#/path/to/myfile.xls
I get the dynamic field attr_creation_date (that other rules make sure), but I don't get it as creation_date. I have also unsuccessfully tried to use copyField like so:
<copyField source="attr_creation_date" dest="creation_date"/>
Yet another try was putting this in solrconfig.xml, but no luck:
<str name="fmap.Creation-Date">creation_date</str>
I'm pretty sure I'm missing something basic here. Any help is most appreciated!
Settings for ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.content">text</str>
<str name="fmap.Last-Save-Date">last_save_date</str>
<str name="fmap.Creation-Date">creation_date</str>
<str name="fmap.Content-Type">content_type</str>
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
</lst>
</requestHandler>
My schema.xml file (lots of default stuff): https://gist.github.com/1358002

Related

How to extract <h2> tag information from html file while indexing them in Solr

I want to extract the <h2> tag information from html file while indexing them in Solr.
For example: In test.htm file i have content like <h2>This is for test</h2>
I need to extract This is for test in h2 index.
I found under conf/managed-schema file already have 'h1' field defined which extract the information from <h1> tag from html which is working fine.
Defined as: <field name="h1" type="text_general" indexed="true" stored="true"/>
So, i want to do same for <h2> tag which is not working.
I tried : <field name="h2" type="text_general" indexed="true" stored="true"/>
I am indexing the test.htm file by running command: /var/www/html/solr-5.3.1/bin/post -p 9000 -c Core -filetypes htm,html /var/www/html/test/Core/test.htm
I am stuck with this...can anybody please help me out?
Finally, after done lots of R&D I get the solution :-).
I added <str name="capture">h2</str>
<str name="fmap.h2">h2</str> into the solrconfig.xml and it start working.
So, my final solrconfig.xml looks like:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
<str name="capture">h1</str>
<str name="fmap.h1">h1</str>
<str name="capture">h2</str>
<str name="fmap.h2">h2</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>
That's it :-)
Might be some other user can face same problem in future so, i am posting this as answer.

Solr 7.x "/export" response handler not working with Streaming Expressions

I am doing a Solr streaming expression, and I try to use the /export handler to fetch all of the results. Consider the following query:
search(main, q=*:*, fl="SSRN",qt="/export",sort="SSRN asc")
I configured my schema.xml for the SSRN field as follows:
<field name="SSRN" type="int" indexed="true" stored="true" required="false" multiValued="false" docValues="true" />
Since the SSRN field is a docValue, it should work. The results are just the standard 10 documents. This is running in a SolrCloud environment with just one node and one shard.
Thanks in advance!
I fixed the issue. It seems that in SOLR-8426: Enable /export, /stream and /sql handlers by default and remove them from example configs, they removed the need to add /export handler to the solrconfig.xml. If you do add it, then it doesn't work. The solution is just to remove this code (from solrconfig.xml):
<requestHandler name="/export" class="solr.SearchHandler">
<lst name="invariants">
<str name="rq">{!xport}</str>
<str name="wt">xsort</str>
<str name="distrib">false</str>
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
</lst>
</requestHandler>

How to configure multiple contextfields in single solr suggester?

I am using apache solr to search records in my current application.
And I was able to filter the suggesions based on DocumentType by configuring the context field.
Now I want to add another context field like departmentType. I am not sure how to configure the suggester for multiple context fields.
This is the suggester that used with single context fields and this is working fine.
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggesterByName</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">fullName</str>
<str name="contextField">documentType</str>
<str name="suggestAnalyzerFieldType">text_general</str>
<str name="buildOnStartup">false</str>
</lst>
</searchComponent>
I refer this post
https://issues.apache.org/jira/browse/SOLR-7888
but still not clear how to configure multiple context fields in a single suggester .
You have to create a new field in your schema.xml as context_field.
This field should have multivalued=true
<field name="context_field" type="text_suggest" multiValued="true" indexed="true" stored="true"/>
Then you have to create this context_field as a list in json for indexing in solr.
"context_field" : ["some document type", "some department type"]
after indexing you can suggest like this-
suggest.q=b&suggest.cfq=context_documentType AND context_departmentType
Hope it works

Indexing issue in Nutch-Solr (Want to display data in solr)

I am trying to crawl data using Nutch and Index that Data in Solr.
I have follow the steps from this Url Using Nutch with Solr and Nutch Wiki Tutorial
I've successfully Index data using Solrindex command
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* but in Result I can't find the Indexed data.
I want result as below Image
But I can't see any result data at right side.
If you want some data to be returned with the search response, check that the targeted fields are stored by solr, then you can set a list of fields to return in your query using fl param (with stored field name as value). You can also set default fl values in solrconfig.xml.
For example, let's say you want content field to be returned. In your schema.xml, in the <fields> declaration you should have the option stored="true" for this field like so :
<field name="content" type="text" indexed="true" stored="true"/>
Then in solrconfig.xml, declare default fl params in the requestHandler definition, you can set specific fields (space separated field names). The xml sample (grabbed from the tutorial) should look like this if we just want data stored in the content field to be returned.
<requestHandler name="/nutch" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
content^0.5 anchor^1.0 title^1.2
</str>
<str name="pf">
content^0.5 anchor^1.5 title^1.2 site^1.5
</str>
<str name="fl">
url content
</str>
<str name="mm">
2<-1 5<-2 6<90%
</str>
<int name="ps">100</int>
<bool hl="true"/>
<str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>
You can override these defaults right in the query. A common use case is to put "*,score" in the fl area in solr query interface so that you can see all stored fields (using wildcard character *) along with the score in the results. You might also want to specify the query type parameter (qt) according to the targeted request handler (should be "/nutch").
Helpful links :
http://wiki.apache.org/solr/SchemaXml#Common_field_options
http://wiki.apache.org/solr/CommonQueryParameters#fl

Basic UIMA with SOLR

I am trying to connect UIMA with Solr. I have downloaded the Solr 3.5 dist and have it successfully running with nutch and tika on windows 7 using solrcell and curl via cygwin.
To begin, I copied the 6 jars from solr/contrib/uima/lib to the working /lib in solr.
Next, I read the readme.txt file in solr/contrib/uima/lib and edited both my solrconfig.xml and schema.xml to no avail.
I then found this link which seemed a bit more applicable since I didnt care to use Alchemy or OpenCalais: http://code.google.com/a/apache-extras.org/p/rondhuit-uima/?redir=1
Still- when I run a curl command that imports a pdf via solrcell I do not get the additional UIMA fields nor do I get anything on my logs. The test.pdf is parsed though and I see the pdf in Solr using:
curl 'http://localhost:8080/solr/update/extract?fmap.content=content&literal.id=doc1&commit=true' -F "file=#test.pdf"
SolrConfig.XML
<updateRequestProcessorChain name="uima">
<processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
<lst name="uimaConfig">
<lst name="runtimeParameters">
<str name="host">http://localhost</str>
<str name="port">8080</str>
</lst>
<str name="analysisEngine">C:\uima\desc\com\rondhuit\uima\desc\NextAnnotatorDescriptor.xml</str>
<bool name="ignoreErrors">true</bool>
<str name="logField">id</str>
<lst name="analyzeFields">
<bool name="merge">false</bool>
<arr name="fields">
<str>content</str>
</arr>
</lst>
<lst name="fieldMappings">
<lst name="type">
<str name="name">com.rondhuit.uima.next.NamedEntity</str>
<lst name="mapping">
<str name="feature">entity</str>
<str name="fieldNameFeature">uname</str>
<str name="dynamicField">*_sm</str>
</lst>
</lst>
</lst>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update/uima" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">uima</str>
</lst>
</requestHandler>
AND I ALSO ADJUSTED MY requestHander:
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">uima</str>
</lst>
</requestHandler>
Schema.XML
<!-- fields for UIMA -->
<field name="uname" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
<dynamicField name="*_sm" type="string" indexed="true" stored="true"/>
All I am trying to do is have UIMA pull out names from text (just to start as a demo) and cannot figure out what I am doing wrong.
Thank you in advance for reading this.
Not sure if this ever got addressed, but in case someone else is looking, I had this same problem yesterday. Figured out that I was calling /update/extract to use solrcell, which doesn't use uima because it's integrated into /update.

Resources