Even after indexing a mysql table,in solr am not able to retrieve data after querying like
http://localhost:8983/solr/select/?q=slno:5
My data-config.xml file is:
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/lbs"
user="user"
password="password"/>
<document name="lbs">
<entity name="radar_places"
query="select * from radar_places"
deltaImportQuery="SELECT * FROM radar_places WHERE slno='${dataimporter.delta.slno}'"
deltaQuery="SELECT slno FROM radar_places WHERE modified > '${dataimporter.last_index_time}'" >
<field column="slno" name="slno" />
<field column="place_id" name="place_id" />
<field column="name" name="name" />
<field column="geo_rss_point" name="geo_rss_point" />
<field column="url" name="url" />
<field column="location_id" name="location_id" />
<field column="time" name="time" />
</entity>
</document>
</dataConfig>
In the browser I had used
http://localhost:8983/solr/dataimport?command=full-import
Later when I checked status of command http://localhost:8983/solr/dataimport/
I got this
<response>
−
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
−
<lst name="initArgs">
−
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
−
<lst name="statusMessages">
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">1151</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2010-02-21 07:53:14</str>
−
<str name="">
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
</str>
<str name="Committed">2010-02-21 07:53:24</str>
<str name="Optimized">2010-02-21 07:53:24</str>
<str name="Total Documents Processed">0</str>
<str name="Total Documents Failed">1151</str>
<str name="Time taken ">0:0:10.56</str>
</lst>
−
<str name="WARNING">
This response format is experimental. It is likely to change in the future.
</str>
</response>
1)Is this has to do anything with <str name="Total Documents Failed">1151</str>
Am not able to figure out whats going wrong.
Are you sure that the data import configuration matches your Solr document schema?
Related
This is a follow up question of this question. I have a list of cities onto which I want to implement spell-checker. I have the priorities/weights of these cities with me. I tried implementing a solrsuggester with a FileDictionaryFactory as a base with the following format:
<city-name> <TAB> <weight> <TAB> <other parameters like citycode,country>
I am passing other attributes like citycode, country etc as pipe separated payload string.
Here's my solrconfig
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">FileDictionaryFactory</str>
<str name="field">name</str>
<str name="weightField">searchscore</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
<str name="sourceLocation">spellings.txt</str>
<str name="storeDir">autosuggest_dict</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">mySuggester</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
and my schema
<field name="name" type="string" indexed="true" stored="true" multiValued="false" />
<field name="countrycode" type="string" indexed="true" stored="true" multiValued="false" />
<field name="latlng" type="location" indexed="true" stored="true" multiValued="false" />
<field name="searchfield" type="text_ngram" indexed="true" stored="false" multiValued="true" omitNorms="true" omitTermFreqAndPositions="true" />
<uniqueKey>id</uniqueKey>
<defaultSearchField>searchfield</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<copyField source="name" dest="searchfield"/>
Now the problem I am facing is I am getting 0 results for each and every search query. Even though I can see the storeDir getting created and it has a bin file with data looks like my payload data.
This is the url format I am using
/suggest?suggest=true&suggest.dictionary=mySuggester&wt=json&suggest.q=cologne
So, I have the following questions:
What does the creation of storeDir signify ? Is it indexed successfully
If yes, then what's wrong with my query ? If no, Am I missing something here(indexPath ???).
Is it the right way to supply search parameters on payload field ? If no, is there any other way ?
There is slight change in your solrconfig.xml. you need to remove buildOnStartup from suggester configuration or set it true.
[solrconfig.xml]
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">FileDictionaryFactory</str>
<str name="field">name</str>
<str name="weightField">searchscore</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">true</str>
<str name="sourceLocation">spellings.txt</str>
<str name="storeDir">autosuggest_dict</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">mySuggester</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
There is a problem in File Based Suggester that it will not build their suggestions through query by setting suggest=true. You need to build the File Based Suggestion on startup.
I was using searchfield as defaultSearchField in schema, but had configured name as suggest field. The moment I changed field to searchfield and suggestAnalyzerFieldType to text_ngram, it started working.
Here is the working solrconfig:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggestions</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">FileDictionaryFactory</str>
<str name="field">searchfield</str>
<str name="weightField">searchscore</str>
<str name="suggestAnalyzerFieldType">text_ngram</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">false</str>
<str name="sourceLocation">spellings.txt</str>
<str name="storeDir">autosuggest_dict</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">suggestions</str>
<str name="suggest.dictionary">results</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
I do a full-import and I get the following response for the below command
http://localhost:8983/solr/karan/dataimport?command=full-import&commit=true&clean=false
But when I run the following snippet
public class SolrJSearcher {
public static void main(String[] args) throws SolrServerException, IOException {
SolrClient solr = new HttpSolrClient("http://localhost:8983/solr/karan");
SolrQuery query = new SolrQuery();
query.set("q", "karan");
QueryResponse response = solr.query(query);
SolrDocumentList results = response.getResults();
for (int i = 0; i < results.size(); ++i) {
System.out.println(results.get(i));
}
}
}
I get no results even though two rows are there.If I change the q to : I get two results without the name karan.Can you please clarify what is going wrong in this If I try in the sample project techproducts with the same changes I get the results as expected.
Solrconfig.xml
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/greed"
user="root"
password="kalkoti"/>
<document>
<entity name="id"
query="select id,name from testing">
</entity>
</document>
</dataConfig>
I have created the collection karan using
solr create -c karan
Response from full-import
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">12</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">2</str>
<str name="Total Documents Processed">2</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2015-07-06 13:55:26</str>
<str name="">
Indexing completed. Added/Updated: 2 documents. Deleted 0 documents.
</str>
<str name="Committed">2015-07-06 13:55:26</str>
<str name="Time taken">0:0:0.431</str>
</lst>
</response>
You aren't specifying what fields should be written to in your data-config.xml file. See https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
You haven't given us your full schema.xml or solrconfig.xml files, so it is hard to tell you exactly what to do, but Solr uses the select request handler as default, which uses the text field as the default search field. Meaning you will need to map whatever database column contains the term karan to the text field.
I have solr 3.6.2 installed on my machine, perfectly running with tomcat. I want to index a wikipedia dump file using solr. How do I do this using DataImportHandler? Any other way? I don't have any knowledge of xml.
The file I have mentioned has size of around 45GB when extracted.
Any help would be greatly appreciated.
Update-
I tried doing whats said on the DataImportHandler page. But there is some error maybe because their version of solr is much older.
My data.config-
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page"
processor="XPathEntityProcessor"
stream="true"
forEach="/mediawiki/page/"
url="./data/enwiki.xml"
transformer="RegexTransformer,DateFormatTransformer"
>
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="revision" xpath="/mediawiki/page/revision/id" />
<field column="user" xpath="/mediawiki/page/revision/contributor/username" />
<field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
<field column="text" xpath="/mediawiki/page/revision/text" />
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
</entity>
</document>
Schema (I just added the parts they have given on the website to my schema.xml file)
The error I am getting is -
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">solr-data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Time Elapsed">0:0:1.381</str>
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">0</str>
<str name="Total Documents Processed">0</str>
<str name="Total Documents Skipped">0</str>
<str name="">Indexing failed. Rolled back all changes.</str>
<str name="Rolledback">2013-05-17 16:48:32</str>
</lst>
<str name="WARNING">
This response format is experimental. It is likely to change in the future.
</str>
</response>
please help
Simple post is not the right way to index Wikipedia. You need to look into using DataImportHandler instead. DIH supports streaming import.
Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text content of the document.
I would like to provide some snippet highlighting of the content and the subject metadata within the content field is skewing the highlight results.
UPDATE: Screenshot of Tika output as indexed by Solr. Highlighted portion is the block of metadata that gets prepended as a block of text to the PDF content.
The ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>
Schema.xml fields. Note "content" receives Tika's content output directly. The "page" and "collection" fields are set with literal values when a doc is posted to the handler.
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="collection" type="text_general" indexed="true" stored="true"/>
<field name="page" type="tint" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
As all other answers are completely irrelevant, I'll post mine:
I have experienced exactly the same problem as OP describes, (Solr 4.3.0, custom config, custom schema, etc. I'm not newbie or something and understand Solr internals pretty well)
This was my ERH config:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">false</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
It was basically configured to ignore everything except the content (i believe it's reasonable for many people).
After careful investigation i found out, that
<str name="captureAttr">false</str>
was the thing caused OP's issue. By default it is turned on, but i turned it off as i did not need it anyway. And that was my mistake. I have no idea why, but it causes Solr to put extracted attributes into fmap.content field altogether with extracted text.
So the solution is to turn it back on.
Final ERH:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="uprefix">ignored_</str>
<str name="fmap.a">ignored_</str>
<str name="fmap.div">ignored_</str>
<str name="fmap.content">text</str>
<str name="captureAttr">true</str>
<str name="lowernames">true</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
Now, only extracted text is put to fmap.content field.
Unfortunately i have not found any piece of documentation which can explain this. Either bug or just stupid behavior
Tika with Solr produces different fields for the content and the metadata.
If you use the Standard ExtractingRequestHandler -
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
The field map content is set to text field which should be only the content of your pdf.
The other metadata fields can be easily checked by modifying the schema.xml.
mark stored true for igonred field type
<fieldtype name="ignored" stored="true" indexed="false" multiValued="true" class="solr.StrField" />
Capture all fields -
<dynamicField name="*" type="ignored" multiValued="true" />
Tika adds lot of fields for the metadata with the content being set separately e.g. response when fed extract handler with a ppt.
<doc>
<arr name="application_name">
<str>Microsoft PowerPoint</str>
</arr>
<str name="category">POT - US</str>
<str name="comments">version 1.1</str>
<arr name="company">
<str>
</str>
</arr>
<arr name="content_type">
<str>application/vnd.ms-powerpoint</str>
</arr>
<arr name="creation_date">
<str>2000-03-15T16:57:27Z</str>
</arr>
<arr name="custom_delivery_date">
<str>
</str>
</arr>
<arr name="custom_docid">
<str>
</str>
</arr>
<arr name="custom_docidinslide">
<str>true</str>
</arr>
<arr name="custom_docidintitle">
<str>true</str>
</arr>
<arr name="custom_docidposition">
<str>0</str>
</arr>
<arr name="custom_event">
<str>
</str>
</arr>
<arr name="custom_final">
<str>false</str>
</arr>
<arr name="custom_mckpapersize">
<str>US</str>
</arr>
<arr name="custom_notespagelayout">
<str>Lower</str>
</arr>
<arr name="custom_title">
<str>Lower Universal Template US</str>
</arr>
<arr name="custom_universal_objects">
<str>true</str>
</arr>
<arr name="edit_time">
<str>284587970000</str>
</arr>
<str name="id">101</str>
<arr name="ignored_">
<str>slideShow</str>
<str>slide</str>
<str>slide</str>
<str>slideNotes</str>
</arr>
<str name="keywords">test</str>
<arr name="last_author">
<str>Corporate</str>
</arr>
<arr name="last_printed">
<str>2000-03-17T20:28:57Z</str>
</arr>
<arr name="last_save_date">
<str>2009-03-24T16:52:26Z</str>
</arr>
<arr name="manager">
<str>
</str>
</arr>
<arr name="meta">
<str>stream_source_info</str>
<str>file:/C:/temp/nuggets/100000.ppt</str>
<str>Last-Author</str>
<str>Corporate</str>
<str>Slide-Count</str>
<str>2</str>
<str>custom:DocIDPosition</str>
<str>0</str>
<str>Application-Name</str>
<str>Microsoft PowerPoint</str>
<str>custom:Delivery Date</str>
<str>
</str>
<str>custom:Event</str>
<str>
</str>
<str>Edit-Time</str>
<str>284587970000</str>
<str>Word-Count</str>
<str>120</str>
<str>Creation-Date</str>
<str>2000-03-15T16:57:27Z</str>
<str>stream_size</str>
<str>181248</str>
<str>Manager</str>
<str>
</str>
<str>stream_name</str>
<str>100000.ppt</str>
<str>Company</str>
<str>
</str>
<str>Keywords</str>
<str>test</str>
<str>Last-Save-Date</str>
<str>2009-03-24T16:52:26Z</str>
<str>Revision-Number</str>
<str>91</str>
<str>Last-Printed</str>
<str>2000-03-17T20:28:57Z</str>
<str>Comments</str>
<str>version 1.1</str>
<str>Template</str>
<str>
</str>
<str>custom:PaperSize</str>
<str>US</str>
<str>custom:DocID</str>
<str>
</str>
<str>xmpTPg:NPages</str>
<str>2</str>
<str>custom:NotesPageLayout</str>
<str>Lower</str>
<str>custom:DocIDinSlide</str>
<str>true</str>
<str>Category</str>
<str>POT - US</str>
<str>custom:Universal Objects</str>
<str>true</str>
<str>custom:Final</str>
<str>false</str>
<str>custom:DocIDinTitle</str>
<str>true</str>
<str>Content-Type</str>
<str>application/vnd.ms-powerpoint</str>
<str>custom:Title</str>
<str>test</str>
</arr>
<arr name="p">
<str>slide-content</str>
<str>slide-content</str>
</arr>
<arr name="revision_number">
<str>91</str>
</arr>
<arr name="slide_count">
<str>2</str>
</arr>
<arr name="stream_name">
<str>100000.ppt</str>
</arr>
<arr name="stream_size">
<str>181248</str>
</arr>
<arr name="stream_source_info">
<str>file:/C:/temp/test/100000.ppt</str>
</arr>
<arr name="template">
<str>
</str>
</arr>
<!-- Content field -->
<arr name="text">
<str>test Test test test test tes t</str>
</arr>
<arr name="title">
<str>test</str>
</arr>
<arr name="word_count">
<str>120</str>
</arr>
<arr name="xmptpg_npages">
<str>2</str>
</arr>
</doc>
I no longer have the problem I described above. Since asking the question, I have updated to Solr 4.0 alpha and recreated schema.xml from the Solr Cell example that ships with the 4.0a package. I suspect my original schema was copying the metadata fields' content to the text field, so it was most likely my own error.
In the solrconfig.xml, where the request handler is defined, add this line below
<str name="fmap.title">ignored_</str>
This tells Tika to simply ignore the title attribute (or which ever attributes you want ignored) it finds embedded within the PDF.
In my case, <str name="xpath">/xhtml:html/xhtml:body//node()</str> allowed extraction of content without the meta.
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content</str>
<!-- Specify where content should be extracted exactly -->
<str name="xpath">/xhtml:html/xhtml:body//node()</str>
</lst>
</requestHandler>
I am trying to build the spellcheck index with IndexBasedSpellChecker
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">text</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
And I want to specify the dynamic field "*_text" as the field option:
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true">
How it can be done?
Copy all the text fields to one field:
<copyField source="*_text" dest="textSpell" />
and then build spellcheck index from field "textSpell"
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">textSpell</str>
<str name="spellcheckIndexDir">./spellchecker</str>
</lst>
This will be helpful
Implementation of solr spellchecker and
spellCheckComponent