I am trying to connect UIMA with Solr. I have downloaded the Solr 3.5 dist and have it successfully running with nutch and tika on windows 7 using solrcell and curl via cygwin.
To begin, I copied the 6 jars from solr/contrib/uima/lib to the working /lib in solr.
Next, I read the readme.txt file in solr/contrib/uima/lib and edited both my solrconfig.xml and schema.xml to no avail.
I then found this link which seemed a bit more applicable since I didnt care to use Alchemy or OpenCalais: http://code.google.com/a/apache-extras.org/p/rondhuit-uima/?redir=1
Still- when I run a curl command that imports a pdf via solrcell I do not get the additional UIMA fields nor do I get anything on my logs. The test.pdf is parsed though and I see the pdf in Solr using:
curl 'http://localhost:8080/solr/update/extract?fmap.content=content&literal.id=doc1&commit=true' -F "file=#test.pdf"
SolrConfig.XML
<updateRequestProcessorChain name="uima">
<processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
<lst name="uimaConfig">
<lst name="runtimeParameters">
<str name="host">http://localhost</str>
<str name="port">8080</str>
</lst>
<str name="analysisEngine">C:\uima\desc\com\rondhuit\uima\desc\NextAnnotatorDescriptor.xml</str>
<bool name="ignoreErrors">true</bool>
<str name="logField">id</str>
<lst name="analyzeFields">
<bool name="merge">false</bool>
<arr name="fields">
<str>content</str>
</arr>
</lst>
<lst name="fieldMappings">
<lst name="type">
<str name="name">com.rondhuit.uima.next.NamedEntity</str>
<lst name="mapping">
<str name="feature">entity</str>
<str name="fieldNameFeature">uname</str>
<str name="dynamicField">*_sm</str>
</lst>
</lst>
</lst>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update/uima" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">uima</str>
</lst>
</requestHandler>
AND I ALSO ADJUSTED MY requestHander:
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">uima</str>
</lst>
</requestHandler>
Schema.XML
<!-- fields for UIMA -->
<field name="uname" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
<dynamicField name="*_sm" type="string" indexed="true" stored="true"/>
All I am trying to do is have UIMA pull out names from text (just to start as a demo) and cannot figure out what I am doing wrong.
Thank you in advance for reading this.
Not sure if this ever got addressed, but in case someone else is looking, I had this same problem yesterday. Figured out that I was calling /update/extract to use solrcell, which doesn't use uima because it's integrated into /update.
Related
I am doing a Solr streaming expression, and I try to use the /export handler to fetch all of the results. Consider the following query:
search(main, q=*:*, fl="SSRN",qt="/export",sort="SSRN asc")
I configured my schema.xml for the SSRN field as follows:
<field name="SSRN" type="int" indexed="true" stored="true" required="false" multiValued="false" docValues="true" />
Since the SSRN field is a docValue, it should work. The results are just the standard 10 documents. This is running in a SolrCloud environment with just one node and one shard.
Thanks in advance!
I fixed the issue. It seems that in SOLR-8426: Enable /export, /stream and /sql handlers by default and remove them from example configs, they removed the need to add /export handler to the solrconfig.xml. If you do add it, then it doesn't work. The solution is just to remove this code (from solrconfig.xml):
<requestHandler name="/export" class="solr.SearchHandler">
<lst name="invariants">
<str name="rq">{!xport}</str>
<str name="wt">xsort</str>
<str name="distrib">false</str>
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
</lst>
</requestHandler>
Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text content of the document.
I would like to provide some snippet highlighting of the content and the subject metadata within the content field is skewing the highlight results.
UPDATE: Screenshot of Tika output as indexed by Solr. Highlighted portion is the block of metadata that gets prepended as a block of text to the PDF content.
The ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>
Schema.xml fields. Note "content" receives Tika's content output directly. The "page" and "collection" fields are set with literal values when a doc is posted to the handler.
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="collection" type="text_general" indexed="true" stored="true"/>
<field name="page" type="tint" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
As all other answers are completely irrelevant, I'll post mine:
I have experienced exactly the same problem as OP describes, (Solr 4.3.0, custom config, custom schema, etc. I'm not newbie or something and understand Solr internals pretty well)
This was my ERH config:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">false</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
It was basically configured to ignore everything except the content (i believe it's reasonable for many people).
After careful investigation i found out, that
<str name="captureAttr">false</str>
was the thing caused OP's issue. By default it is turned on, but i turned it off as i did not need it anyway. And that was my mistake. I have no idea why, but it causes Solr to put extracted attributes into fmap.content field altogether with extracted text.
So the solution is to turn it back on.
Final ERH:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">true</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
Now, only extracted text is put to fmap.content field.
Unfortunately i have not found any piece of documentation which can explain this. Either bug or just stupid behavior
Tika with Solr produces different fields for the content and the metadata.
If you use the Standard ExtractingRequestHandler -
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
The field map content is set to text field which should be only the content of your pdf.
The other metadata fields can be easily checked by modifying the schema.xml.
mark stored true for igonred field type
<fieldtype name="ignored" stored="true" indexed="false" multiValued="true" class="solr.StrField" />
Capture all fields -
<dynamicField name="*" type="ignored" multiValued="true" />
Tika adds lot of fields for the metadata with the content being set separately e.g. response when fed extract handler with a ppt.
<doc>
<arr name="application_name">
<str>Microsoft PowerPoint</str>
</arr>
<str name="category">POT - US</str>
<str name="comments">version 1.1</str>
<arr name="company">
<str>
</str>
</arr>
<arr name="content_type">
<str>application/vnd.ms-powerpoint</str>
</arr>
<arr name="creation_date">
<str>2000-03-15T16:57:27Z</str>
</arr>
<arr name="custom_delivery_date">
<str>
</str>
</arr>
<arr name="custom_docid">
<str>
</str>
</arr>
<arr name="custom_docidinslide">
<str>true</str>
</arr>
<arr name="custom_docidintitle">
<str>true</str>
</arr>
<arr name="custom_docidposition">
<str>0</str>
</arr>
<arr name="custom_event">
<str>
</str>
</arr>
<arr name="custom_final">
<str>false</str>
</arr>
<arr name="custom_mckpapersize">
<str>US</str>
</arr>
<arr name="custom_notespagelayout">
<str>Lower</str>
</arr>
<arr name="custom_title">
<str>Lower Universal Template US</str>
</arr>
<arr name="custom_universal_objects">
<str>true</str>
</arr>
<arr name="edit_time">
<str>284587970000</str>
</arr>
<str name="id">101</str>
<arr name="ignored_">
<str>slideShow</str>
<str>slide</str>
<str>slide</str>
<str>slideNotes</str>
</arr>
<str name="keywords">test</str>
<arr name="last_author">
<str>Corporate</str>
</arr>
<arr name="last_printed">
<str>2000-03-17T20:28:57Z</str>
</arr>
<arr name="last_save_date">
<str>2009-03-24T16:52:26Z</str>
</arr>
<arr name="manager">
<str>
</str>
</arr>
<arr name="meta">
<str>stream_source_info</str>
<str>file:/C:/temp/nuggets/100000.ppt</str>
<str>Last-Author</str>
<str>Corporate</str>
<str>Slide-Count</str>
<str>2</str>
<str>custom:DocIDPosition</str>
<str>0</str>
<str>Application-Name</str>
<str>Microsoft PowerPoint</str>
<str>custom:Delivery Date</str>
<str>
</str>
<str>custom:Event</str>
<str>
</str>
<str>Edit-Time</str>
<str>284587970000</str>
<str>Word-Count</str>
<str>120</str>
<str>Creation-Date</str>
<str>2000-03-15T16:57:27Z</str>
<str>stream_size</str>
<str>181248</str>
<str>Manager</str>
<str>
</str>
<str>stream_name</str>
<str>100000.ppt</str>
<str>Company</str>
<str>
</str>
<str>Keywords</str>
<str>test</str>
<str>Last-Save-Date</str>
<str>2009-03-24T16:52:26Z</str>
<str>Revision-Number</str>
<str>91</str>
<str>Last-Printed</str>
<str>2000-03-17T20:28:57Z</str>
<str>Comments</str>
<str>version 1.1</str>
<str>Template</str>
<str>
</str>
<str>custom:PaperSize</str>
<str>US</str>
<str>custom:DocID</str>
<str>
</str>
<str>xmpTPg:NPages</str>
<str>2</str>
<str>custom:NotesPageLayout</str>
<str>Lower</str>
<str>custom:DocIDinSlide</str>
<str>true</str>
<str>Category</str>
<str>POT - US</str>
<str>custom:Universal Objects</str>
<str>true</str>
<str>custom:Final</str>
<str>false</str>
<str>custom:DocIDinTitle</str>
<str>true</str>
<str>Content-Type</str>
<str>application/vnd.ms-powerpoint</str>
<str>custom:Title</str>
<str>test</str>
</arr>
<arr name="p">
<str>slide-content</str>
<str>slide-content</str>
</arr>
<arr name="revision_number">
<str>91</str>
</arr>
<arr name="slide_count">
<str>2</str>
</arr>
<arr name="stream_name">
<str>100000.ppt</str>
</arr>
<arr name="stream_size">
<str>181248</str>
</arr>
<arr name="stream_source_info">
<str>file:/C:/temp/test/100000.ppt</str>
</arr>
<arr name="template">
<str>
</str>
</arr>
<!-- Content field -->
<arr name="text">
<str>test Test test test test tes t</str>
</arr>
<arr name="title">
<str>test</str>
</arr>
<arr name="word_count">
<str>120</str>
</arr>
<arr name="xmptpg_npages">
<str>2</str>
</arr>
</doc>
I no longer have the problem I described above. Since asking the question, I have updated to Solr 4.0 alpha and recreated schema.xml from the Solr Cell example that ships with the 4.0a package. I suspect my original schema was copying the metadata fields' content to the text field, so it was most likely my own error.
In the solrconfig.xml, where the request handler is defined, add this line below
<str name="fmap.title">ignored_</str>
This tells Tika to simply ignore the title attribute (or which ever attributes you want ignored) it finds embedded within the PDF.
In my case, <str name="xpath">/xhtml:html/xhtml:body//node()</str> allowed extraction of content without the meta.
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content</str>
<!-- Specify where content should be extracted exactly -->
<str name="xpath">/xhtml:html/xhtml:body//node()</str>
</lst>
</requestHandler>
I need to customize Solr highlighting prefix and suffix like this:
<span class="highlight">text</span>
instead of the default
<em>text</em>
That's why I'm using this configuration within the solrconfig.xml for the HighlightComponent:
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
<fragmentsBuilder name="simple" default="true" class="solr.highlight.SimpleFragmentsBuilder">
<lst name="defaults">
<str name="hl.tag.pre"><![CDATA[<span class="highlight">]]></str>
<str name="hl.tag.post"><![CDATA[</span>]]></str>
</lst>
</fragmentsBuilder>
</highlighting>
</searchComponent>
The following are the default parameters for my standard request handler:
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="hl">true</str>
<str name="hl.fl">body,title</str>
<str name="hl.useFastVectorHighlighter">true</str>
</lst>
</requestHandler>
When I search for the text word I do get the text word highlighted, but not always using the prefix and suffix I configured:
<lst name="highlighting">
<lst name="document_1">
<arr name="body">
<str>my <em>text</em> highlighted</str>
</arr>
<arr name="title">
<str>my <span class="highlight">text</span> highlighted</str>
</arr>
</lst>
</lst>
Does anybody know why?
I am guessing you are seeing this behavior behavior because you only have the prefix and suffix defined for the SimpleFragmentsBuilder and the other highlights are coming from another fragment builder.
I am using a custom prefix and suffix for my highlighting and I set this value in the formatter section of the highlighting section of the solrconfig.xml and have not had any issues as it will apply to all fragment builders.
So maybe try the following:
<highlighting>
<fragmentsBuilder name="simple" default="true"
class="solr.highlight.SimpleFragmentsBuilder"/>
<!-- Configure the standard formatter -->
<formatter name="html" class="org.apache.solr.highlight.HtmlFormatter"
default="true">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<span class="highlight">]]></str>
<str name="hl.simple.post"><![CDATA[</span>]]></str>
</lst>
</formatter>
</highlighting>
I finally found out why! I'm using fastVectorHighlighter to make highlighting faster.
At the beginning I was highlighting only the title field and everything worked fine.
When I added the body field to highlighting I forgot to enable termVectors=true.
Now that my body field looks like this
<field name="body" type="text" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
after a full reindex highlighting is working perfectly:
<lst name="highlighting">
<lst name="document_1">
<arr name="body">
<str>my <span class="highlight">text</span> highlighted</str>
</arr>
<arr name="title">
<str>my <span class="highlight">text</span> highlighted</str>
</arr>
</lst>
</lst>
Previously the body field highlighting did work, but without fastVectorHighlighter since the field didn't have the termVectors=true parameter. That's why I got body highlighted with default prefix and suffix. Since fastVectorHighlighter is a completely different highlighting method, the configuration is different as well.
To avoid this kind of mistakes, as long the users can choose what fields to highlight with the hl.fl parameter, I'd recommend to include also the configuration for the standard highlighting (formatter element, class solr.highlight.HtmlFormatter) like this:
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
<formatter name="html" default="true" class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<span class="highlight">]]></str>
<str name="hl.simple.post"><![CDATA[</span>]]></str>
</lst>
</formatter>
<fragmentsBuilder name="simple" default="true" class="solr.highlight.SimpleFragmentsBuilder">
<lst name="defaults">
<str name="hl.tag.pre"><![CDATA[<span class="highlight">]]></str>
<str name="hl.tag.post"><![CDATA[</span>]]></str>
</lst>
</fragmentsBuilder>
</highlighting>
</searchComponent>
This way highlighting will work with the same prefix and suffix even for fields with termVectors disabled.
I'm indexing rich text documents into SOLR 3.4 using ExtractingRequestHandler and I'm having trouble getting it to behave like I want it to.
I would like to store creation date as a field to use for faceted search later and have defined the following in schema.xml:
<field name="creation_date" type="date" indexed="true" stored="true"/>
I index like this:
curl -s "http://localhost:8983/solr/update/extract?literal.id=myid&resource.name=myfile.xls&commit=true" -F myfile=#/path/to/myfile.xls
I get the dynamic field attr_creation_date (that other rules make sure), but I don't get it as creation_date. I have also unsuccessfully tried to use copyField like so:
<copyField source="attr_creation_date" dest="creation_date"/>
Yet another try was putting this in solrconfig.xml, but no luck:
<str name="fmap.Creation-Date">creation_date</str>
I'm pretty sure I'm missing something basic here. Any help is most appreciated!
Settings for ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.content">text</str>
<str name="fmap.Last-Save-Date">last_save_date</str>
<str name="fmap.Creation-Date">creation_date</str>
<str name="fmap.Content-Type">content_type</str>
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
</lst>
</requestHandler>
My schema.xml file (lots of default stuff): https://gist.github.com/1358002
I am trying to build the spellcheck index with IndexBasedSpellChecker
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">text</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
And I want to specify the dynamic field "*_text" as the field option:
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true">
How it can be done?
Copy all the text fields to one field:
<copyField source="*_text" dest="textSpell" />
and then build spellcheck index from field "textSpell"
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">textSpell</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
This will be helpful
Implementation of solr spellchecker and
spellCheckComponent