Indexing wikipedia dump with solr

Indexing wikipedia dump with solr - solr

I have solr 3.6.2 installed on my machine, perfectly running with tomcat. I want to index a wikipedia dump file using solr. How do I do this using DataImportHandler? Any other way? I don't have any knowledge of xml.
The file I have mentioned has size of around 45GB when extracted.
Any help would be greatly appreciated.
Update-
I tried doing whats said on the DataImportHandler page. But there is some error maybe because their version of solr is much older.
My data.config-
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page"
processor="XPathEntityProcessor"
stream="true"
forEach="/mediawiki/page/"
url="./data/enwiki.xml"
transformer="RegexTransformer,DateFormatTransformer"
>
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="revision" xpath="/mediawiki/page/revision/id" />
<field column="user" xpath="/mediawiki/page/revision/contributor/username" />
<field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
<field column="text" xpath="/mediawiki/page/revision/text" />
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
</entity>
</document>
Schema (I just added the parts they have given on the website to my schema.xml file)
The error I am getting is -
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">solr-data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Time Elapsed">0:0:1.381</str>
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">0</str>
<str name="Total Documents Processed">0</str>
<str name="Total Documents Skipped">0</str>
<str name="">Indexing failed. Rolled back all changes.</str>
<str name="Rolledback">2013-05-17 16:48:32</str>
</lst>
<str name="WARNING">
This response format is experimental. It is likely to change in the future.
</str>
</response>
please help

Simple post is not the right way to index Wikipedia. You need to look into using DataImportHandler instead. DIH supports streaming import.

Related

integrating solr autosuggest functionality error

I am trying to integrate auto suggest functionality of solr in my project. I use this as my starting point. I changed my searched fields accordingly.
my schema.xml
<field name="name" type="text_suggest" indexed="true" stored="true"/>
<field name="manu" type="text_suggest" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
<!-- A variant of textsuggest which only matches from the very left edge -->
<copyField source="name" dest="textnge"/>
<field name="textnge" type="autocomplete_edge" indexed="true" stored="false" />
<!-- A variant of name which matches from the left edge of all terms (implicit truncation) -->
<copyField source="name" dest="textng"/>
<field name="textng" type="autocomplete_ngram" indexed="true" stored="false" omitNorms="true" omitTermFreqAndPositions="true" />
My request handler in solrconfig.xml
<requestHandler class="solr.SearchHandler" name="/ac" default="true" >
<lst name="defaults">
<str name="defType">edismax</str>
<str name="rows">10</str>
<str name="fl">*,score</str>
<str name="qf">name^50 manu^20.0 textng^50.0</str>
<str name="pf">textnge^50.0</str>
<str name="bf">product(log(sum(popularity,1)),100)^20</str>
<str name="debugQuery">false</str>
</lst>
</requestHandler>
The problem is that my "/ac" handler is acting more like "/select" handler. When I type "moni" I am getting nothing. But when I type "monitor", its returning me the documents containing monitor in them.
I have been trying this for whole day and nothing seems to work. Any help will be deeply appreciated

Well when you look for "moni" in your query, you are actually specifically saying that you're looking for the "moni" keyword. Try looking for multiterms keywrods by adding "*", such as q=moni*.
You can also look in other fieldType analyser like autocomplete_edge (q=textnge:mori) or autocomplete_ngram (q=textng:mori) for more data.

I think you need to specify search component in solarconfig.xml like below
<searchComponent class="solr.SpellCheckComponent" name="ac">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookupFactory</str>
<str name="field">yourfieldname</str> <!-- the indexed field to derive suggestions from -->
<float name="threshold">0.005</float>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>

textual content without metadata from Tika via SolrCell

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text content of the document.
I would like to provide some snippet highlighting of the content and the subject metadata within the content field is skewing the highlight results.
UPDATE: Screenshot of Tika output as indexed by Solr. Highlighted portion is the block of metadata that gets prepended as a block of text to the PDF content.
The ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>
Schema.xml fields. Note "content" receives Tika's content output directly. The "page" and "collection" fields are set with literal values when a doc is posted to the handler.
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="collection" type="text_general" indexed="true" stored="true"/>
<field name="page" type="tint" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

As all other answers are completely irrelevant, I'll post mine:
I have experienced exactly the same problem as OP describes, (Solr 4.3.0, custom config, custom schema, etc. I'm not newbie or something and understand Solr internals pretty well)
This was my ERH config:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">false</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
It was basically configured to ignore everything except the content (i believe it's reasonable for many people).
After careful investigation i found out, that
<str name="captureAttr">false</str>
was the thing caused OP's issue. By default it is turned on, but i turned it off as i did not need it anyway. And that was my mistake. I have no idea why, but it causes Solr to put extracted attributes into fmap.content field altogether with extracted text.
So the solution is to turn it back on.
Final ERH:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">true</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
Now, only extracted text is put to fmap.content field.
Unfortunately i have not found any piece of documentation which can explain this. Either bug or just stupid behavior

Tika with Solr produces different fields for the content and the metadata.
If you use the Standard ExtractingRequestHandler -
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
The field map content is set to text field which should be only the content of your pdf.
The other metadata fields can be easily checked by modifying the schema.xml.
mark stored true for igonred field type
<fieldtype name="ignored" stored="true" indexed="false" multiValued="true" class="solr.StrField" />
Capture all fields -
<dynamicField name="*" type="ignored" multiValued="true" />
Tika adds lot of fields for the metadata with the content being set separately e.g. response when fed extract handler with a ppt.
<doc>
<arr name="application_name">
<str>Microsoft PowerPoint</str>
</arr>
<str name="category">POT - US</str>
<str name="comments">version 1.1</str>
<arr name="company">
<str>
</str>
</arr>
<arr name="content_type">
<str>application/vnd.ms-powerpoint</str>
</arr>
<arr name="creation_date">
<str>2000-03-15T16:57:27Z</str>
</arr>
<arr name="custom_delivery_date">
<str>
</str>
</arr>
<arr name="custom_docid">
<str>
</str>
</arr>
<arr name="custom_docidinslide">
<str>true</str>
</arr>
<arr name="custom_docidintitle">
<str>true</str>
</arr>
<arr name="custom_docidposition">
<str>0</str>
</arr>
<arr name="custom_event">
<str>
</str>
</arr>
<arr name="custom_final">
<str>false</str>
</arr>
<arr name="custom_mckpapersize">
<str>US</str>
</arr>
<arr name="custom_notespagelayout">
<str>Lower</str>
</arr>
<arr name="custom_title">
<str>Lower Universal Template US</str>
</arr>
<arr name="custom_universal_objects">
<str>true</str>
</arr>
<arr name="edit_time">
<str>284587970000</str>
</arr>
<str name="id">101</str>
<arr name="ignored_">
<str>slideShow</str>
<str>slide</str>
<str>slide</str>
<str>slideNotes</str>
</arr>
<str name="keywords">test</str>
<arr name="last_author">
<str>Corporate</str>
</arr>
<arr name="last_printed">
<str>2000-03-17T20:28:57Z</str>
</arr>
<arr name="last_save_date">
<str>2009-03-24T16:52:26Z</str>
</arr>
<arr name="manager">
<str>
</str>
</arr>
<arr name="meta">
<str>stream_source_info</str>
<str>file:/C:/temp/nuggets/100000.ppt</str>
<str>Last-Author</str>
<str>Corporate</str>
<str>Slide-Count</str>
<str>2</str>
<str>custom:DocIDPosition</str>
<str>0</str>
<str>Application-Name</str>
<str>Microsoft PowerPoint</str>
<str>custom:Delivery Date</str>
<str>
</str>
<str>custom:Event</str>
<str>
</str>
<str>Edit-Time</str>
<str>284587970000</str>
<str>Word-Count</str>
<str>120</str>
<str>Creation-Date</str>
<str>2000-03-15T16:57:27Z</str>
<str>stream_size</str>
<str>181248</str>
<str>Manager</str>
<str>
</str>
<str>stream_name</str>
<str>100000.ppt</str>
<str>Company</str>
<str>
</str>
<str>Keywords</str>
<str>test</str>
<str>Last-Save-Date</str>
<str>2009-03-24T16:52:26Z</str>
<str>Revision-Number</str>
<str>91</str>
<str>Last-Printed</str>
<str>2000-03-17T20:28:57Z</str>
<str>Comments</str>
<str>version 1.1</str>
<str>Template</str>
<str>
</str>
<str>custom:PaperSize</str>
<str>US</str>
<str>custom:DocID</str>
<str>
</str>
<str>xmpTPg:NPages</str>
<str>2</str>
<str>custom:NotesPageLayout</str>
<str>Lower</str>
<str>custom:DocIDinSlide</str>
<str>true</str>
<str>Category</str>
<str>POT - US</str>
<str>custom:Universal Objects</str>
<str>true</str>
<str>custom:Final</str>
<str>false</str>
<str>custom:DocIDinTitle</str>
<str>true</str>
<str>Content-Type</str>
<str>application/vnd.ms-powerpoint</str>
<str>custom:Title</str>
<str>test</str>
</arr>
<arr name="p">
<str>slide-content</str>
<str>slide-content</str>
</arr>
<arr name="revision_number">
<str>91</str>
</arr>
<arr name="slide_count">
<str>2</str>
</arr>
<arr name="stream_name">
<str>100000.ppt</str>
</arr>
<arr name="stream_size">
<str>181248</str>
</arr>
<arr name="stream_source_info">
<str>file:/C:/temp/test/100000.ppt</str>
</arr>
<arr name="template">
<str>
</str>
</arr>
<!-- Content field -->
<arr name="text">
<str>test Test test test test tes t</str>
</arr>
<arr name="title">
<str>test</str>
</arr>
<arr name="word_count">
<str>120</str>
</arr>
<arr name="xmptpg_npages">
<str>2</str>
</arr>
</doc>

I no longer have the problem I described above. Since asking the question, I have updated to Solr 4.0 alpha and recreated schema.xml from the Solr Cell example that ships with the 4.0a package. I suspect my original schema was copying the metadata fields' content to the text field, so it was most likely my own error.

In the solrconfig.xml, where the request handler is defined, add this line below
<str name="fmap.title">ignored_</str>
This tells Tika to simply ignore the title attribute (or which ever attributes you want ignored) it finds embedded within the PDF.

In my case, <str name="xpath">/xhtml:html/xhtml:body//node()</str> allowed extraction of content without the meta.
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content</str>
<!-- Specify where content should be extracted exactly -->
<str name="xpath">/xhtml:html/xhtml:body//node()</str>
</lst>
</requestHandler>

Basic UIMA with SOLR

I am trying to connect UIMA with Solr. I have downloaded the Solr 3.5 dist and have it successfully running with nutch and tika on windows 7 using solrcell and curl via cygwin.
To begin, I copied the 6 jars from solr/contrib/uima/lib to the working /lib in solr.
Next, I read the readme.txt file in solr/contrib/uima/lib and edited both my solrconfig.xml and schema.xml to no avail.
I then found this link which seemed a bit more applicable since I didnt care to use Alchemy or OpenCalais: http://code.google.com/a/apache-extras.org/p/rondhuit-uima/?redir=1
Still- when I run a curl command that imports a pdf via solrcell I do not get the additional UIMA fields nor do I get anything on my logs. The test.pdf is parsed though and I see the pdf in Solr using:
curl 'http://localhost:8080/solr/update/extract?fmap.content=content&literal.id=doc1&commit=true' -F "file=#test.pdf"
SolrConfig.XML
<updateRequestProcessorChain name="uima">
<processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
<lst name="uimaConfig">
<lst name="runtimeParameters">
<str name="host">http://localhost</str>
<str name="port">8080</str>
</lst>
<str name="analysisEngine">C:\uima\desc\com\rondhuit\uima\desc\NextAnnotatorDescriptor.xml</str>
<bool name="ignoreErrors">true</bool>
<str name="logField">id</str>
<lst name="analyzeFields">
<bool name="merge">false</bool>
<arr name="fields">
<str>content</str>
</arr>
</lst>
<lst name="fieldMappings">
<lst name="type">
<str name="name">com.rondhuit.uima.next.NamedEntity</str>
<lst name="mapping">
<str name="feature">entity</str>
<str name="fieldNameFeature">uname</str>
<str name="dynamicField">*_sm</str>
</lst>
</lst>
</lst>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update/uima" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">uima</str>
</lst>
</requestHandler>
AND I ALSO ADJUSTED MY requestHander:
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">uima</str>
</lst>
</requestHandler>
Schema.XML
<!-- fields for UIMA -->
<field name="uname" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
<dynamicField name="*_sm" type="string" indexed="true" stored="true"/>
All I am trying to do is have UIMA pull out names from text (just to start as a demo) and cannot figure out what I am doing wrong.
Thank you in advance for reading this.

Not sure if this ever got addressed, but in case someone else is looking, I had this same problem yesterday. Figured out that I was calling /update/extract to use solrcell, which doesn't use uima because it's integrated into /update.

Solr spellcheck configuration

I am trying to build the spellcheck index with IndexBasedSpellChecker
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">text</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
And I want to specify the dynamic field "*_text" as the field option:
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true">
How it can be done?

Copy all the text fields to one field:
<copyField source="*_text" dest="textSpell" />
and then build spellcheck index from field "textSpell"
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">textSpell</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>

This will be helpful
Implementation of solr spellchecker and
spellCheckComponent

retrieve data after indexing a mysql table

Even after indexing a mysql table,in solr am not able to retrieve data after querying like
http://localhost:8983/solr/select/?q=slno:5
My data-config.xml file is:
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/lbs"
user="user"
password="password"/>
<document name="lbs">
<entity name="radar_places"
query="select * from radar_places"
deltaImportQuery="SELECT * FROM radar_places WHERE slno='${dataimporter.delta.slno}'"
deltaQuery="SELECT slno FROM radar_places WHERE modified > '${dataimporter.last_index_time}'" >
<field column="slno" name="slno" />
<field column="place_id" name="place_id" />
<field column="name" name="name" />
<field column="geo_rss_point" name="geo_rss_point" />
<field column="url" name="url" />
<field column="location_id" name="location_id" />
<field column="time" name="time" />
</entity>
</document>
</dataConfig>
In the browser I had used
http://localhost:8983/solr/dataimport?command=full-import
Later when I checked status of command http://localhost:8983/solr/dataimport/
I got this
<response>
−
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
−
<lst name="initArgs">
−
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
−
<lst name="statusMessages">
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">1151</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2010-02-21 07:53:14</str>
−
<str name="">
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
</str>
<str name="Committed">2010-02-21 07:53:24</str>
<str name="Optimized">2010-02-21 07:53:24</str>
<str name="Total Documents Processed">0</str>
<str name="Total Documents Failed">1151</str>
<str name="Time taken ">0:0:10.56</str>
</lst>
−
<str name="WARNING">
This response format is experimental. It is likely to change in the future.
</str>
</response>
1)Is this has to do anything with <str name="Total Documents Failed">1151</str>
Am not able to figure out whats going wrong.

Are you sure that the data import configuration matches your Solr document schema?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Indexing wikipedia dump with solr - solr

Simple post is not the right way to index Wikipedia. You need to look into using DataImportHandler instead. DIH supports streaming import.

Related

integrating solr autosuggest functionality error

textual content without metadata from Tika via SolrCell

Basic UIMA with SOLR

Solr spellcheck configuration

retrieve data after indexing a mysql table

Categories

Resources