Many-to-one mapping within Apache Solr - solr

I am using Solr to index my database of reports. Reports can have text, submitter information, etc. This currently works and looks like this:
"docs": [
{
"Text": "Some Report Text"
"ReportId": "1",
"Date": "2013-08-09T14:59:28.147Z",
"SubmitterId": "11111",
"FirstName": "John",
"LastName": "Doe",
"_version_": 1444554112206110700
}
]
The other thing a report can have is viewers (which is a one-to-many relationship between a single report and the viewers.) I want to be able to capture those viewers like this in my JSON output:
"docs": [
{
"Text": "Some Report Text"
"ReportId": "1",
"Date": "2013-08-09T14:59:28.147Z",
"SubmitterId": "11111",
"FirstName": "John",
"LastName": "Doe",
"Viewers": [
{ ViewerId: "22222" },
{ ViewerId: "33333" }
]
"_version_": 1444554112206110700
}
]
I cannot seem to get that to happen, however. Here is my data-config.xml (parts removed that aren't necessary to the question):
<entity name="Report" query="select * from Reports">
<field column="Text" />
<field column="ReportId" />
<!-- Get Submitter Information as another entity. -->
<entity name="Viewers" query="select * from ReportViewers where Id='${Report.ReportId}'">
<field column="Id" name="ViewerId" />
</entity>
</entity>
And the schema.xml:
<field name="Text" type="text_en" indexed="true" stored="true" />
<field name="ReportId" type="string" indexed="true" stored="true" />
<field name="Viewers" type="string" indexed="true" stored="true" multiValued="true" />
<field name="ViewerId" type="string" indexed="true" stored="true" />
When I do the data import, I just don't see anything. No errors, nothing apparently wrong, but I'm pretty sure my data-config and/or my schema are not correct. What am I doing wrong?

Unfortunately Solr does not allow nesting (see http://lucene.472066.n3.nabble.com/Possible-to-have-Solr-documents-with-deeply-nested-data-structures-i-e-hashes-within-hashes-td4004285.html). You need to flatten your data!
So
"Viewers": [
{ ViewerId: "22222" },
{ ViewerId: "33333" }
]
is not possible. Instead flatten it and have a ViewerIds array:
"ViewerIds": ["22222", "33333" ]
In your schema, you will have:
<field name="ViewerIds" type="string" indexed="true" stored="true" multiValued="true" />
and modify your data-config accordingly.

Related

Importing Nested Documents in Solr using DataImportHandler

I am working on a project where the specification requires a parent - child relationship within the Solr data collection ... i.e. a user and the collection of languages they speak (each of which is made up of multiple data fields). My production system is a 4.10 Solr implementation but I have a 5.5 implementation as my disposal as well. Thus far, I am not getting this to work on either one and I have yet to find a complete documentation source on how to implement this.
The goal is to get a resulting document from Solr that looks like this:
{
"id": 123,
"firstName": "John",
"lastName": "Doe",
"languagesSpoken": [
{
"id": 243,
"abbreviation": "en",
"name": "English"
},
{
"id": 442,
"abbreviation": "fr",
"name": "French"
}
]
}
In my schema.xml, I have flatted out all of the fields as follows:
<field name="id" type="int" indexed="true" stored="true" required="true" multiValued="false" />
<field name="firstName" type="text_general" indexed="true" stored="true" />
<field name="lastName" type="text_general" indexed="true" stored="true" />
<field name="languagesSpoken" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="languagesSpoken_id" type="int" indexed="true" stored="true" />
<field name="languagesSpoken_abbreviation " type="text_general" indexed="true" stored="true" />
<field name="languagesSpoken_name" type="text_general" indexed="true" stored="true" />
The latest rendition of my db-data-config.xml looks like this:
<dataConfig>
<dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:...." />
<document name="clients">
<entity name="client" query="SELECT * FROM clients" deltaImportQuery="SELECT * FROM clients WHERE id = ${dih.delta.id}" deltaQuery="SELECT id FROM clients WHERE updateDate > '${dih.last_index_time}'">
<field column="id" name="id" />
<field column="firstName" name="firstName" />
<field column="lastName" name="lastName" />
<entity name="languagesSpoken" child="true" query="SELECT id, abbreviation, name FROM languages WHERE clientId = ${client.id}">
<field name="languagesSpoken_id" column="id" />
<field name="languagesSpoken_abbreviation" column="abbreviation" />
<field name="languagesSpoken_name" column="name" />
</entity>
</entity>
</document>
...
On the 4.10 server, when the data comes out of Solr, I get one flat document record with the fields for one language inline with the firstName and lastname like this:
{
"id": 123,
"firstName": "John",
"lastName": "Doe",
"languagesSpoken_id": 243,
"languagesSpoken_abbreviation ": "en",
"languagesSpoken_name": "English"
}
On the 5.5 server, when the data comes out, I get separate documents for the root client document and the child language documents with no relationship between them like this:
{
"id": 123,
"firstName": "John",
"lastName": "Doe"
},
{
"languagesSpoken_id": 243,
"languagesSpoken_abbreviation": "en",
"languagesSpoken_name": "English"
},
{
"languagesSpoken_id": 442,
"languagesSpoken_abbreviation": "fr",
"languagesSpoken_name": "French"
}
I have spent several days now trying to figure out what is going on here to no avail. Can anybody provide me with a pointer as to what I am missing here?
Thanks,
-- Jeff
You may want to flatten your json objects like below before you import into SOLR;
https://stackoverflow.com/a/19101235/929902
POST http://localhost:8983/solr/ggg_core/update?boost=1.0&commitWithin=1000&overwrite=true&wt=json HTTP/1.1
Then once you read from SOLR, you can unflatten it in similar way.

Solr query only returns Id only

I want to retrieve the name from the user but it returns the id only.
I am using solr5.5.0
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://server:3306/dbname" user="user" password="pwd"/>
<document name="user">
<entity name="user" query="select id,name from user">
<field column="id" name="id"/>
<field column="name" name="name"/>
</entity>
</document>
</dataConfig>
<field type="int" indexed="true" stored="true" name="id" />
<field multiValued="true" name="name" type="text" indexed="true" stored="true" />
output
response:
{
"numFound": 38,
"start": 0,
"docs": [
{
"id": "1",
"_version_": 1527443171669180400
},
{
"id": "3",
"_version_": 1527443171672326100
},

Solr BlockJoin Indexing for Solr 4.10.1

I am trying to index a nested structure as below and having difficulty indexing both with SOlrJ and the DIH. I have battled with this for a while and would really appreciate some help on this.
How do i fix this with either SolrJ or DIH.
Thanks
What i want my data to look like my index:
"docs": [
{
"name": "MR INCREDIBLE ",
"id": 101,
"job": "super hero",
"_version_": "1483934897344086016"
"children": [
{
"c_name":"Violet"
"c_age":10
"c_gender":"female"
},
{
"c_name":"Dash"
"c_age":8
"c_gender":"male"
}
]
}
]
My schema.xml
<schema name="datasearch" version="1.5">
<uniqueKey>id</uniqueKey>
<fields>
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" />
<field name="name" type="text" indexed="true" stored="true" />
<field name="job" type="string" indexed="true" stored="true"/>
<!-- I want to add children here -->
<!-- <field name="children" indexed="true" stored="true"/> -->
<field name="c_name" type="string" indexed="true" stored="true"/>
<field name="c_age" type="int" indexed="true" stored="true"/>
<field name="c_sex" type="string" indexed="true" stored="true"/>
</fields>
<types>
<fieldType name="string" class="solr.TrieLongField" />
<fieldType name="int" class="solr.TrieIntField" />
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" />
<fieldType name="long" class="solr.StrField" sortMissingLast="true"/>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
</types>
<defaultSearchField>name</defaultSearchField>
</schema>
SolrJ Attempt
val serverUrl = current.configuration.getString("solr.server.url").get
val solr = new HttpSolrServer(serverUrl)
def testAddChildDoc={
val doc = {
new SolrInputDocument(){
addField("id", "101")
addField("name", "Mr Incredible")
}
}
val c1 = new SolrInputDocument(){
addField("c_name", "violet")
addField("c_age", 10)
}
val c2 = new SolrInputDocument(){
addField("c_name", "dash")
addField("c_age", 8)
}
doc.addChildDocument(c1)
doc.addChildDocument(c2)
solr.deleteByQuery("*:*")
solr.add(doc)
solr.commit(true, true)
}
Response
=>ERROR org.apache.solr.core.SolrCore – org.apache.solr.common.SolrException: [doc=null] missing required field: id
[RemoteSolrException: [doc=null] missing required field: id]
So i go ahead and add id to childDocs making the above
...
val c1 = new SolrInputDocument(){
addField("id", "101")
addField("c_name", "violet")
addField("c_age", 10)
}
val c2 = new SolrInputDocument(){
addField("id", "101")
addField("c_name", "dash")
addField("c_age", 8)
}
.....
Then rerun the get-all query, now i get the results below
SolrJ Attempt 2 plus get-all query
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"indent": "true",
"q": "*:*",
"_": "1415194092582",
"wt": "json"
}
},
"response": {
"numFound": 3,
"start": 0,
"docs": [
{
"id": 101,
"c_name": violet,
"c_age": "10",
},
{
"id": 101,
"c_name": dash,
"c_age": "8"
},
{
"id": 101,
"name": "Mr Incredible",
"_version_": "1483938552238571520"
}
]
}
}
So i give up here and try the DIH as below
db-dataconfig.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://xxx:5432/xxxx"
user="xx" password="xx"
readOnly="true" autoCommit="false" transactionIsolation="TRANSACTION_READ_COMMITTED" holdability="CLOSE_CURSORS_AT_COMMIT" />
<document>
<entity name="parent" query="select id,name, job from PARENTS LIMIT 1" >
<field column="name"/>
<field column="id"/>
<field column="job"/>
<entity child="true" name="children" query="select c_name, c_gender, c_age from CHILDREN" where="pid = ${parent.id}" processor="CachedSqlEntityProcessor">
<field column="c_age" />
<field column="c_gender" />
<field column="c_name"/>
</entity>
</entity>
</document>
</dataConfig>
query get-all after full import with DIH as above and no children indexed
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"indent": "true",
"q": "*:*",
"_": "1415195060664",
"wt": "json"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"name": "Mr Incredible",
"id": 101,
"_version_": "1483939357483073536"
}
]
}
}
To be able to use child="true" in DIH apply the patch from https://issues.apache.org/jira/browse/SOLR-5147 (I think it's the same DIH patch at solr-3076).
The patch itself seems to be incompatible in neglectable details with the current trunk.
In order to get the following response from Solr 4.10.1
{
"name": "MR INCREDIBLE ",
"id": 101,
"job": "super hero",
"type": "parent",
"_root_":"101"
"_version_": "1483934897344086016"
"childDocuments": [
{
"c_name":"Violet",
"c_age":10,
"c_gender":"female",
"id":"101_Violet",
"_root_":"101"
},
{
"c_name":"Dash",
"c_age":8,
"c_gender":"male",
"id":"101Dash",
"_root_":"101"
}
]
}
"type" field needs to be defined in the schema to differentiate between parent and child documents:
<fields>
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" />
<field name="name" type="text" indexed="true" stored="true" />
<field name="job" type="string" indexed="true" stored="true"/>
<field name="c_name" type="string" indexed="true" stored="true"/>
<field name="c_age" type="int" indexed="true" stored="true"/>
<field name="c_gender" type="string" indexed="true" stored="true"/>
<field name="type" type="string" indexed="true" stored="true" />
</fields>
Child documents also need to have an unique "id", just like any other document.
All the documents in the index should be in parent/child relation, otherwise the queries may return unexpected results. In case you need documents which are neither parents or children, assign them a fake parent.
SolrJ
To work with child/parent docs, solrj.jar version 4.5 or higher is required.
SolrServer solr = new HttpSolrServer(serverUrl);
SolrInputDocument doc = new SolrInputDocument();
String id = "101";
doc.addField("id", id);
doc.addField("name", "Mr Incredible");
doc.addField("job", "super hero");
doc.addField("type", "parent");
SolrInputDocument childDoc1 = new SolrInputDocument();
String name1 = "Violet";
childDoc1.addField("id", id + "_" + name1);
childDoc1.addField("c_name", name1);
childDoc1.addField("c_age", 10);
childDoc1.addField("c_gender", "female");
doc.addChildDocument(childDoc1);
SolrInputDocument childDoc2 = new SolrInputDocument();
String name2 = "Dash";
childDoc2.addField("id", id + "_" + name2);
childDoc2.addField("c_name", name2);
childDoc2.addField("c_age", 8);
childDoc2.addField("c_gender", "male");
doc.addChildDocument(childDoc2);
solr.add(doc);
solr.commit();
Finally, the query looks like this:
http://localhost/solr/core/select?q={!parent which='type:parent'}&fl=*,[child parentFilter=type:parent]&wt=json&indent=true
To get only results of female gender:
http://localhost/solr/core/select?q={!parent which='type:parent'}c_gender:female&fl=*,[child parentFilter=type:parent childFilter=c_gender:female]&wt=json&indent=true

Solr Data Import - array of strings

Hi can anybody point me in the right direction for using Solr's Data Import Handler (DIH) to create an array of strings based on the SQL query.
My Solr DIH config looks like this:
<dataConfig>
<dataSource driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost:5432/data"
user="xxxxx"
password="xxxxxx" />
<document>
<entity name="item" query="select id, subject from table1">
<field column="id" name="id" />
<field column="subject" name="subject" />
<entity name="ip_address" query="select ip_address from table2 where id='${item.id}'">
<field column="ip_address" name="ip_address" />
</entity>
</entity>
</document>
</dataConfig>
The query on table2 actually returns multiple items so I need this to be reflected in my documents.
e.g. :
{
"numFound": 1,
"start": 0,
"docs": [
{
"id": "29331109",
"subject": "Test document",
"ip_address": [
"88.103.210.139",
"88.103.210.144",
"88.103.210.133"
],
"_version_": 1468439879154139100
}
]
}
This is almost working for me except that Solr is only populating the first ip_address in my documents.
Here's the relevant part of my Schema:
<!-- Custom Field names -->
<field name="serial_number" type="string" indexed="true" stored="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="ip_address" type="string" indexed="true" stored="true" multiValued="true"/>
How is the "ip_address" field defined in schema.xml? It should be multiValued field.

How to write nested schema.xml in solr?

How to write nested schema.xml in solr
The document in schema.xml says
<!-- points to the root document of a block of nested documents. Required for nested
document support, may be removed otherwise
-->
<field name="_root_" type="string" indexed="true" stored="false"/>
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/collection1/conf/schema.xml?view=markup
Which can be used in
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
What will be schema.xml for nesting the following items:
Person string
Address
city string
postcode string
I know this is an old question, but I ran into a similar issue. Modifying my solution for yours, the fields you need to add to your schema.xml are as follows:
<field name="person" type="string" indexed="true" stored="true" />
<field name="address" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="address.city" type="string" indexed="true" stored="true" />
<field name="address.postcode" type="string" indexed="true" stored="true" />
Then when you run it you should be able to add the following JSON to your Solr instance and see the matching output in the query:
{
"person": "John Smith",
"address": {
"city": "San Diego",
"postcode": 92093
}
}

Resources