The dataimporthandler status is showing it has indexed and 10 documents are added, but not showing any result when I search for word that is part of added document. If I give : in search it displays all records
example of clob record:
<?xml version="1.0" encoding="UTF-8" ?>
<message xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="someurl" xmlns:csp="someurl.xsd" xsi:schemaLocation="somelocation jar: id="002" message-type="create">
<content>
<dsp:row>
<dsp:channel>100</dsp:channel>
<dsp:role>115</dsp:role>
</dsp:row>
<![CDATA[ <ol><li>java</li></ol><li>ASP</li>]]>
</body></content></message>
data-config.xml
<document name="doc">
<entity name="MYCONTENT" transformer="ClobTransformer"
query="SELECT CID,XML FROM MYCONTENT">
<field column="CID" name="CID"/>
<field column="XML" clob="true" name="XML"/>
</entity>
</document>
schema.xml
<field name="CID" type="string" indexed="true" stored="true" required="true"/>
<field name="XML" type="string" indexed="true" stored="true" required="true"/>
<dynamicField name="*" type="ignored" />
<uniqueKey>CID</uniqueKey>
<defaultSearchField>XML</defaultSearchField>
solrconfig.xml
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/data-config.xml</str>
<str name="rows">10</str>
</lst>
</requestHandler>
I do not know why it is not showing result when I search for "Java" "ASP". Any help is greatly appreciated.
thanks in advance
srini
You have two things to fix.
First, the "string" field type treats the entire document as a single token. You need a text field type.
Second, Solr does not parse the XML in your CLOB, it indexes it as raw text, splitting tokens as specified by your choice of tokenizer for that field. For example, if you used a whitespace tokenizer, it would treat "115" as a single token, and a search for "115" would not match.
For testing, I would try using the HTMLStripCharFilterFactory in that field definition before the tokenizer. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
That should strip a fair amount of the XML. If you want to process it a specific way, you will probably want to learn about XPathEntityProcessor, which can extract parts of the XML for indexing. See: http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor
Related
So the situation is as follows:
Solr has a dataimport directly on the database
I have a table project in a relationship to unit. A project can hold up to 5 units
ID's are automatically generated from the database, starting by 1
ID's are unique for each table but not across the database
Since Solr requires each document to have a unique ID I created a field solrId which gets its ID's from solr.UUIDUpdateProcessorFactory.
However, the dataimport only fetches a few projects and no units whatsoever. Can someone point me in the right direction?
The relevant passages:
solrconfig.xml:
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">solrId</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
....
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">wiensued-data-config.xml</str>
<str name="update.chain">uuid</str>
</lst>
</requestHandler>
managed-schema:
<uniqueKey>solrId</uniqueKey>
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
<!-- solrId is the real ID -->
<field name="solrId" type="uuid" multiValued="false" indexed="true" stored="true" />
<!-- the ID from the database -->
<field name="id" type="int" multiValued="false" indexed="true" stored="true"/>
The dataimporthandler is configured to index id (from the table) into either projectId or unitId
The stacktrace is:
org.apache.solr.common.SolrException: [doc=null] missing required field: solrId
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:265)
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:107)
at org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:212)
at org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:185)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:259)
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:433)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1384)
at org.apache.solr.update.DirectUpdateHandler2.updateDocument(DirectUpdateHandler2.java:920)
at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:913)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:302)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:979)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1192)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:748)
at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:91)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:80)
at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:254)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:526)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:415)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:474)
at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:457)
at java.lang.Thread.run(Thread.java:748)
However, the solrId is provided as far as I can tell
just get this fixed in your dih config, it will be just cleaner and easier.
Just prepend a 'p' to the project id to create the id, and supply that to solr. Likewise with the units (prepend 'u'). You get the idea:
<entity name="project" pk="id" query="select concat('p', id) as solrid, ...
Of course the sql depends on your DB.
I've followed the examples listed in the documentation here: http://wiki.apache.org/solr/Deduplication and https://cwiki.apache.org/confluence/display/solr/De-Duplication
However, when analyzing the results every signatureField gets returned like so:
0000000000000000
I can't seem to figure out why a unique signature isn't being generated.
Relevant config sections:
solrconfig.xml
<requestHandler name="/update"
class="solr.XmlUpdateRequestHandler">
<!-- See below for information on defining
updateRequestProcessorChains that can be used by name
on each Update Request
-->
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
</requestHandler>
...
<!-- Deduplication
An example dedup update processor that creates the "id" field
on the fly based on the hash code of some other fields. This
example has overwriteDupes set to false since we are using the
id field as the signatureField and Solr will maintain
uniqueness based on that anyway.
-->
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signatureField</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">name,features,cat</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
schema.xml
<fields>
<!-- Valid attributes for fields:
name: mandatory - the name for the field
type: mandatory - the name of a previously defined type from the
<types> section
indexed: true if this field should be indexed (searchable or sortable)
stored: true if this field should be retrievable
multiValued: true if this field may contain multiple values per document
omitNorms: (expert) set to true to omit the norms associated with
this field (this disables length normalization and index-time
boosting for the field, and saves some memory). Only full-text
fields or fields that need an index-time boost need norms.
Norms are omitted for primitive (non-analyzed) types by default.
termVectors: [false] set to true to store the term vector for a
given field.
When using MoreLikeThis, fields used for similarity should be
stored for best performance.
termPositions: Store position information with the term vector.
This will increase storage costs.
termOffsets: Store offset information with the term vector. This
will increase storage costs.
default: a value that should be used if no value is specified
when adding a document.
-->
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
... etc
I'm wondering if anyone can steer me in the right direction?
Given the following (single core) query's:
http://localhost/solr/a/select?indent=true&q=*:*&rows=100&start=0&wt=json
http://localhost/solr/b/select?indent=true&q=*:*&rows=100&start=0&wt=json
The first query returns "numFound":40000"
The second query returns "numFound":10000"
I tried putting these together by:
http://localhost/solr/a/select?indent=true&shards=localhost/solr/a,localhost/solr/b&q=*:*&rows=100&start=0&wt=json
Now I get "numFound":50000".
The only problem is "a" has more columns than "b". So the multiple collections request only returns the values of a.
Is it possible to query multiple collections with different fields? Or do they have to be the same? And how should I change my third url to get this result?
What you need is - what I call - a unification core. That schema itself will have no content, it is only used as a sort of wrapper to unify those fields you want to display from both cores. In there you will need
a schema.xml that wraps up all the fields that you want to have in your unified result
a query handler that combines the two different cores for you
An important restriction beforehand taken from the Solr Wiki page about DistributedSearch
Documents must have a unique key and the unique key must be stored (stored="true" in schema.xml) The unique key field must be unique across all shards. If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic.
As example, I have shard-1 with the fields id, title, description and shard-2 with the fields id, title, abstractText. So I have these schemas
schema of shard-1
<schema name="shard-1" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="description"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
schema of shard-2
<schema name="shard-2" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="abstractText"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
To unify these schemas I create a third schema that I call shard-unification, which contains all four fields.
<schema name="shard-unification" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="abstractText"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="description"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
Now I need to make use of this combined schema, so I create a query handler in the solrconfig.xml of the solr-unification core
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">*:*</str>
<str name="qf">id title description abstractText</str>
<str name="fl">*,score</str>
<str name="mm">100%</str>
</lst>
</requestHandler>
<queryParser name="edismax" class="org.apache.solr.search.ExtendedDismaxQParserPlugin" />
That's it. Now some index-data is required in shard-1 and shard-2. To query for a unified result, just query shard-unification with appropriate shards param.
http://localhost/solr/shard-unification/select?q=*:*&rows=100&start=0&wt=json&shards=localhost/solr/shard-1,localhost/solr/shard-2
This will return you a result like
{
"responseHeader":{
"status":0,
"QTime":10},
"response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
{
"id":1,
"title":"title 1",
"description":"description 1",
"score":1.0},
{
"id":2,
"title":"title 2",
"abstractText":"abstract 2",
"score":1.0}]
}}
Fetch the origin shard of a document
If you want to fetch the originating shard into each document, you just need to specify [shard] within fl. Either as parameter with the query or within the requesthandler's defaults, see below. The brackets are mandatory, they will also be in the resulting response.
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">*:*</str>
<str name="qf">id title description abstractText</str>
<str name="fl">*,score,[shard]</str>
<str name="mm">100%</str>
</lst>
</requestHandler>
<queryParser name="edismax" class="org.apache.solr.search.ExtendedDismaxQParserPlugin" />
Working Sample
If you want to see a running example, checkout my solrsample project on github and execute the ShardUnificationTest. I have also included the shard-fetching by now.
Shards should be used in Solr
When an index becomes too large to fit on a single system, or when a single query takes too long to execute
so the number and names of the columns should always be the same. This is specified in this document (where the previous quote also come from):
http://wiki.apache.org/solr/DistributedSearch
If you leave your query as it is and make the two shards with the same fields this shoudl just work as expected.
If you want more info about how the shards work in SolrCould have a look at this docuemtn also:
http://wiki.apache.org/solr/SolrCloud
I've made a schema for solr and I don't know the name of every field from the document I want to add, so I defined a dynamicField like this:
<dynamicField name="*" type="text_general" indexed="true" stored="true" />
Right now I'm testing and I don't get an error when importing for undefined fields in the document, but when I try to query for *:something (anything other than "*") I don't get any results back.
My question is how can I define a catch all field, is there any right way to do this? Or am I under the wrong impression that a query for *:something would normally search in all the documents and all the fields for "something"?
The search key word `*:something` can not get anything from solr, no matter what kind of field you are using, dinamicField or not.
If I understand your question correctly, you want a dynamicField to store all fields and want to query all fields laterly.
Here is my solution.
First, defining a default_search field for search:
<field name="default_search" type="text" indexed="true" stored="true" multiValued="true"/>
And then copy all fields into the default_search field.
<copyField source="*" dest="default_search" />
Finally, you can make a query for all fields like this:
http://host/core/select/?q=something
or
http://host/core/select/?q=default_search:something
AFAIK *:something does not query all the fields. It looks for a field names *.
I get the below error when attempting to do a query for *:test
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">9</int>
<lst name="params">
<str name="wt">xml</str>
<str name="q">*:test</str>
</lst>
</lst>
<lst name="error">
<str name="msg">undefined field *</str>
<int name="code">400</int>
</lst>
</response>
You would need to define a catchall field using copyField in your schema.xml.
I would recommend not using a simple wildcard for dynamic fields. Instead something like this:
<dynamicField name="*_text" type="text_general" indexed="true" stored="true" />
and then have a catchall field
<field name="CatchAll" type="text_general" indexed="true" stored="true" multiValued="false" />
You can have a copyField defined as below, to support query such as q=something
<copyField source="*_text" dest="CatchAll" />
I am trying to convince solr to perform a bulk import of a sqlite database.
I follow the all instruction from Solr-Wiki.
I configured DataImportHandler to open that database through jdbc successfully and I can start the import http://localhost:8080/solr/dataimport?command=full-import
but whatever I do, DIH didn't add any document even though it seems index the DB
the result
<str name="command">full-import</str>
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">**14**</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-04-06 01:14:30</str>
<str name="">**Indexing completed**. **Added/Updated: 0 documents**. Deleted 0 documents.</str>
<str name="Committed">2012-04-06 01:14:32</str>
<str name="Optimized">2012-04-06 01:14:32</str>
<str name="Total Documents Processed">0</str>
I use the emp table in Oracle DB
data-config.xml
<dataConfig>
<dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//localhost:1521/ORCL" user="scott" password="tiger"/>
<document>
<entity name="emp" query="select EMPNO, ENAME from EMP">
<field column="EMPNO" name="empno" />
<field column="ENAME" name="ename" />
</entity>
</document>
</dataConfig>
schema.xml
<field name="empno" type="int" indexed="true" stored="true"/>
<field name="ename" type="string" indexed="true" stored="true"/>
It doesn't seem to index, but not to stored indexed data
Any ideas why this problem happen?
EDIT 1
Log show warning message like..
WARNING: Error creating document : SolrInputDocument[{ename=ename(1.0)={SMITH}, empno=empno(1.0)={7369}}]
org.apache.solr.common.SolrException: [doc=null] missing required field: id
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:346)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:73)
at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:636)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
that kind of log followed and
this warning message show up end of the log
2012. 4. 6 오후 12:12:25 org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {deleteByQuery=*:*,add=[(null), (null), (null), (null), (null), (null), (null), (null), ... (14 adds)],optimize=} 0 0
I thought missing required field: id has some relation with the configuration in
the schema.xml
<uniqueKey>id</uniqueKey>
but after delete, I got this message
HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: false in solr.xml ------------------------------------------------------------- org.apache.solr.common.SolrException: QueryElevationComponent requires the schema to have a uniqueKeyField implemented using StrField at org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java:158) at org.apache.solr.core.SolrResourceLoader.inform
Any advice?
Try:
<entity name="emp" query="select EMPNO, ENAME from EMP">
<field column="EMPNO" name="id" />
<field column="ENAME" name="ename" />
in data-config.xml and put back:
<uniqueKey>id</uniqueKey>
in schema.xml and also let the field id.
Or u can simply replace:
<uniqueKey>id</uniqueKey>
with:
<uniqueKey>epno</uniqueKey>
Hope that will work.
You can also add an autoincrement id with
<dataConfig>
<script><![CDATA[
id = 1;
function GenerateId(row) {
row.put('id', (id ++).toFixed());
return row;
}
]]></script>
<dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//localhost:1521/ORCL" user="scott" password="tiger"/>
<document>
<entity name="emp" query="select EMPNO, ENAME from EMP" transformer="script:GenerateId">
<field column="EMPNO" name="empno" />
<field column="ENAME" name="ename" />
</entity>
</document>
</dataConfig>