Importing Nested Documents in Solr using DataImportHandler - solr

I am working on a project where the specification requires a parent - child relationship within the Solr data collection ... i.e. a user and the collection of languages they speak (each of which is made up of multiple data fields). My production system is a 4.10 Solr implementation but I have a 5.5 implementation as my disposal as well. Thus far, I am not getting this to work on either one and I have yet to find a complete documentation source on how to implement this.
The goal is to get a resulting document from Solr that looks like this:
{
"id": 123,
"firstName": "John",
"lastName": "Doe",
"languagesSpoken": [
{
"id": 243,
"abbreviation": "en",
"name": "English"
},
{
"id": 442,
"abbreviation": "fr",
"name": "French"
}
]
}
In my schema.xml, I have flatted out all of the fields as follows:
<field name="id" type="int" indexed="true" stored="true" required="true" multiValued="false" />
<field name="firstName" type="text_general" indexed="true" stored="true" />
<field name="lastName" type="text_general" indexed="true" stored="true" />
<field name="languagesSpoken" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="languagesSpoken_id" type="int" indexed="true" stored="true" />
<field name="languagesSpoken_abbreviation " type="text_general" indexed="true" stored="true" />
<field name="languagesSpoken_name" type="text_general" indexed="true" stored="true" />
The latest rendition of my db-data-config.xml looks like this:
<dataConfig>
<dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:...." />
<document name="clients">
<entity name="client" query="SELECT * FROM clients" deltaImportQuery="SELECT * FROM clients WHERE id = ${dih.delta.id}" deltaQuery="SELECT id FROM clients WHERE updateDate > '${dih.last_index_time}'">
<field column="id" name="id" />
<field column="firstName" name="firstName" />
<field column="lastName" name="lastName" />
<entity name="languagesSpoken" child="true" query="SELECT id, abbreviation, name FROM languages WHERE clientId = ${client.id}">
<field name="languagesSpoken_id" column="id" />
<field name="languagesSpoken_abbreviation" column="abbreviation" />
<field name="languagesSpoken_name" column="name" />
</entity>
</entity>
</document>
...
On the 4.10 server, when the data comes out of Solr, I get one flat document record with the fields for one language inline with the firstName and lastname like this:
{
"id": 123,
"firstName": "John",
"lastName": "Doe",
"languagesSpoken_id": 243,
"languagesSpoken_abbreviation ": "en",
"languagesSpoken_name": "English"
}
On the 5.5 server, when the data comes out, I get separate documents for the root client document and the child language documents with no relationship between them like this:
{
"id": 123,
"firstName": "John",
"lastName": "Doe"
},
{
"languagesSpoken_id": 243,
"languagesSpoken_abbreviation": "en",
"languagesSpoken_name": "English"
},
{
"languagesSpoken_id": 442,
"languagesSpoken_abbreviation": "fr",
"languagesSpoken_name": "French"
}
I have spent several days now trying to figure out what is going on here to no avail. Can anybody provide me with a pointer as to what I am missing here?
Thanks,
-- Jeff

You may want to flatten your json objects like below before you import into SOLR;
https://stackoverflow.com/a/19101235/929902
POST http://localhost:8983/solr/ggg_core/update?boost=1.0&commitWithin=1000&overwrite=true&wt=json HTTP/1.1
Then once you read from SOLR, you can unflatten it in similar way.

Related

Solr query only returns Id only

I want to retrieve the name from the user but it returns the id only.
I am using solr5.5.0
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://server:3306/dbname" user="user" password="pwd"/>
<document name="user">
<entity name="user" query="select id,name from user">
<field column="id" name="id"/>
<field column="name" name="name"/>
</entity>
</document>
</dataConfig>
<field type="int" indexed="true" stored="true" name="id" />
<field multiValued="true" name="name" type="text" indexed="true" stored="true" />
output
response:
{
"numFound": 38,
"start": 0,
"docs": [
{
"id": "1",
"_version_": 1527443171669180400
},
{
"id": "3",
"_version_": 1527443171672326100
},

Solr Data Import - array of strings

Hi can anybody point me in the right direction for using Solr's Data Import Handler (DIH) to create an array of strings based on the SQL query.
My Solr DIH config looks like this:
<dataConfig>
<dataSource driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost:5432/data"
user="xxxxx"
password="xxxxxx" />
<document>
<entity name="item" query="select id, subject from table1">
<field column="id" name="id" />
<field column="subject" name="subject" />
<entity name="ip_address" query="select ip_address from table2 where id='${item.id}'">
<field column="ip_address" name="ip_address" />
</entity>
</entity>
</document>
</dataConfig>
The query on table2 actually returns multiple items so I need this to be reflected in my documents.
e.g. :
{
"numFound": 1,
"start": 0,
"docs": [
{
"id": "29331109",
"subject": "Test document",
"ip_address": [
"88.103.210.139",
"88.103.210.144",
"88.103.210.133"
],
"_version_": 1468439879154139100
}
]
}
This is almost working for me except that Solr is only populating the first ip_address in my documents.
Here's the relevant part of my Schema:
<!-- Custom Field names -->
<field name="serial_number" type="string" indexed="true" stored="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="ip_address" type="string" indexed="true" stored="true" multiValued="true"/>
How is the "ip_address" field defined in schema.xml? It should be multiValued field.

How to write nested schema.xml in solr?

How to write nested schema.xml in solr
The document in schema.xml says
<!-- points to the root document of a block of nested documents. Required for nested
document support, may be removed otherwise
-->
<field name="_root_" type="string" indexed="true" stored="false"/>
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/collection1/conf/schema.xml?view=markup
Which can be used in
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
What will be schema.xml for nesting the following items:
Person string
Address
city string
postcode string
I know this is an old question, but I ran into a similar issue. Modifying my solution for yours, the fields you need to add to your schema.xml are as follows:
<field name="person" type="string" indexed="true" stored="true" />
<field name="address" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="address.city" type="string" indexed="true" stored="true" />
<field name="address.postcode" type="string" indexed="true" stored="true" />
Then when you run it you should be able to add the following JSON to your Solr instance and see the matching output in the query:
{
"person": "John Smith",
"address": {
"city": "San Diego",
"postcode": 92093
}
}

Solr Indexing SQL record with duplicate uniqueKey

We need a full-text search for a db with millions of records (music meta-data) and I've only been working on Solr for 2 weeks roughly, I need some help regarding indexing. I am using DataImportHandler and have SQL query that generates result like this:
As you can see in the attached image above, the id (Integer data type) is repeated in the SQL result also used for in DIH and when I set uniqueKey to <uniqueKey>id</uniqueKey> solr overwites the values leaving only one record/row , in fact I think the last one processed which is the one with countryCode 'TL'.
When I first had this issue, I knew why solr was overwriting the value, its's normal so I thought of adding a global identifer to each record in db, a guid - without thinking things properly, I ended up up with same duplicates as you can see charGuid which is a uuid() from MySQL is duplicated.
But when I use the charGuid (String data type) as uniqueKey to <uniqueKey>charGuid</uniqueKey>, I get all records indexed and nothing is overwritten but of course duplicates are inevitable. The problem I for-see here is when I have to do an incremental update, solr will not be able to know which document to update exactly, In fact a quick test from admin console, revealed that the last or first record its find with that unique key is updated. - This is not acceptable.
I stumbled upon an article referencing multiValued="true", I thought making the fields that represents a JOIN column in my SQL will do the trick, but it doesn't. I was hoping a record with id:10 will be returned with a List of countryCode but no.
I am just puzzled as to how to circumvent this issue and why I did not find a similar problem posted by someone.
If I don't get a meaningful answer, I guess I will have to use charGuid as <uniqueKey> which allows duplicate and then use Solr Document Deduplication Detection to handle updates of my index but I want to believe, there is a better way.
Update
Here is my data-config.xml and schema.xml defination:
<entity name="albums" query="select * from Album">
<entity name="track" query="select t.id as id, t.title as trackTitle, t.removed as trackRemovedDate, t.productState from Track t where t.albumId='${albums.id}'"/>
<entity name="albumSalesAreaId" query="select asa.salesAreaId as albumSalesAreaId from AlbumSalesArea asa where asa.albumId='${albums.id}'"/>
<entity name="albumSalesArea" query="select sa.name as albumSalesArea from SalesArea sa where sa.id='${albumSalesAreaId.salesAreaId}'"/>
<entity name="salesAreaCountry" query="select sac.countryId as 'salesAreaCountry' from SalesAreaCountry sac where sac.salesAreaId ='${salesArea.id}'"/>
<entity name="countryId" query="select c.id as 'countryId' from Country c where c.id = '${salesAreaCountry.countryId}'"/>
<entity name="countryName" query="select c.name as 'countryName' from Country c where c.id = '${salesAreaCountry.countryId}'"/>
</entity>
**Schema.xml**
<!--new multivalue fields -->
<field name="albumSalesArea" type="int" stored="true" indexed="true" multiValued="true"/>
<field name="albumSalesAreaId" type="int" indexed="true" stored="true" multiValued="true"/>
<field name="salesAreaCountry" type="int" stored="true" indexed="true" multiValued="true"/>
<field name="countryId" type="int" indexed="true" stored="true" multiValued="true"/>
<field name="countryName" type="text_general" indexed="true" stored="true" multiValued="true"/>
When I compare my solr response with SQL result, I see countryCode but solr has none, only returned
"albumSalesAreaId": [
1,
3
],
Not sure why country etc not showing up.
Update 2
data-config.xml
<document name="content">
<entity name="albums" query="select * from Album">
<entity name="tracks" query="select t.id, t.title, t.removed, t.productState from Track t where t.albumId='${albums.id}'">
<field column="id" name="id" />
<field column="title" name="trackTitle" />
<field column="removed" name="trackRemovedDate" />
<field column="productState" name="trackProductState" />
</entity>
<entity name="albumSalesAreaIds" query="select salesAreaId from AlbumSalesArea where albumId = '${albums.id}'">
<field column="salesAreaId" name="albumSalesAreaId"/>
</entity>
<entity name="albumSalesAreaNames" query="select name from SalesArea where id = '${albumSalesAreaIds.salesAreaId}'">
<field column="name" name="albumSalesArea"/>
</entity>
<entity name="salesAreaCountryIds" query="select countryId from SalesAreaCountry where salesAreaId ='${albumSalesAreaIds.salesAreaId}'">
<field column="countryId" name="countryId" />
</entity>
<entity name="salesAreaCountry" query="select name from Country where id ='${salesAreaCountryIds.countryId}'">
<field column="name" name="countryName" />
</entity>
<field column="title" name="albumTitle"/>
<field column="removed" name="albumRemovedDate"/>
<field column="productState" name="albumProductState" />
</entity>
</document>
schema.xml
<field name="catchall" type="text_general" stored="true" indexed="true" multiValued="true"/>
<field name="publisher" type="text_general" indexed="true" stored="true"/>
<field name="uuid" type="binary" indexed="false" stored="true"/>
<field name="trackRemovedDate" type="tdate" indexed="true" stored="true"/>
<field name="albumRemovedDate" type="tdate" indexed="true" stored="true"/>
<field name="trackProductState" type="int" indexed="true" stored="true"/>
<field name="albumProductState" type="int" indexed="true" stored="true"/>
<field name="countryCode" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="albumTitle" type="text_general" indexed="true" stored="true"/>
<field name="trackTitle" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="guid" type="text_general" indexed="true" stored="true"/>
<!--new multivalue fields -->
<field name="albumSalesAreaId" type="int" indexed="true" stored="true" multiValued="true"/>
<field name="salesAreaCountry" type="int" stored="true" indexed="true" multiValued="true"/>
<field name="countryId" type="int" indexed="true" stored="true" multiValued="true"/>
<field name="countryName" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="albumSalesArea" type="text_general" indexed="true" stored="true" multiValued="true"/>
sample solr response for id:5
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"indent": "true",
"q": "id:5",
"_": "1383221233535",
"wt": "json"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "5",
"catchall": [
"5",
"Test Album 5",
"2011-10-21 00:00:00.0",
"[B#261ca3cb",
"Test Track 1",
"Ya man 2",
"2011-10-17 16:21:29.0",
"1",
"1450412569164513280"
],
"albumTitle": "Test Album 5",
"albumRemovedDate": "2011-10-21T00:00:00Z",
"uuid": "6oT/MMl+RDaPyKpGK1KN0w==",
"trackTitle": [
"Test Track 1",
"Ya man 2"
],
"trackRemovedDate": "2011-10-17T16:21:29Z",
"albumSalesAreaId": [
1
],
"_version_": 1450412569164513300
}
]
}
}
SQL result for id:5
trackTitle and albumSalesAreaId seem to be correct but not sure why others not been included however if hard code the albumSalesAreaNames entiy with from SalesArea where id = 1, then I get albumSalesArea field added to result, so it seem like from SalesArea where id = '${albumSalesAreaIds.salesAreaId}'" is returning null, also confirmed from by 'IN' test earlier.
This looks really a problem simply solved with a multivalued field.
If you use multivalued field in this structure what you will obtain is one document with ID=10, all the duplicated values will just be there once and all other fields will be multivalued. For example the NAME field will contain 4 different countries and so the country_code.
have a look at this article on how to structure your dataimportHandler to achieve this:
http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example
basically you need one query for each multivalued field:
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document name="products">
<entity name="item" query="select * from item">
<field column="ID" name="id" />
<field column="code" name="code" />
<entity name="countryName" query="select name from countrytable where item_id='${item.ID}'">
<field name="name" column="description" />
</entity>
<entity name="countryCode" query="select countryCode from countrytable where item_id='${item.ID}'">
</entity>
</entity>
</document>
(Posted on behalf of the OP).
SOLUTION
<entity name="albumSalesAreaNames" query="select name from SalesArea where id = '${albumSalesAreaIds.salesAreaId}'">
<field column="name" name="albumSalesArea"/>
</entity>
<field column="salesAreaId" name="albumSalesAreaId"/>
</entity>

Many-to-one mapping within Apache Solr

I am using Solr to index my database of reports. Reports can have text, submitter information, etc. This currently works and looks like this:
"docs": [
{
"Text": "Some Report Text"
"ReportId": "1",
"Date": "2013-08-09T14:59:28.147Z",
"SubmitterId": "11111",
"FirstName": "John",
"LastName": "Doe",
"_version_": 1444554112206110700
}
]
The other thing a report can have is viewers (which is a one-to-many relationship between a single report and the viewers.) I want to be able to capture those viewers like this in my JSON output:
"docs": [
{
"Text": "Some Report Text"
"ReportId": "1",
"Date": "2013-08-09T14:59:28.147Z",
"SubmitterId": "11111",
"FirstName": "John",
"LastName": "Doe",
"Viewers": [
{ ViewerId: "22222" },
{ ViewerId: "33333" }
]
"_version_": 1444554112206110700
}
]
I cannot seem to get that to happen, however. Here is my data-config.xml (parts removed that aren't necessary to the question):
<entity name="Report" query="select * from Reports">
<field column="Text" />
<field column="ReportId" />
<!-- Get Submitter Information as another entity. -->
<entity name="Viewers" query="select * from ReportViewers where Id='${Report.ReportId}'">
<field column="Id" name="ViewerId" />
</entity>
</entity>
And the schema.xml:
<field name="Text" type="text_en" indexed="true" stored="true" />
<field name="ReportId" type="string" indexed="true" stored="true" />
<field name="Viewers" type="string" indexed="true" stored="true" multiValued="true" />
<field name="ViewerId" type="string" indexed="true" stored="true" />
When I do the data import, I just don't see anything. No errors, nothing apparently wrong, but I'm pretty sure my data-config and/or my schema are not correct. What am I doing wrong?
Unfortunately Solr does not allow nesting (see http://lucene.472066.n3.nabble.com/Possible-to-have-Solr-documents-with-deeply-nested-data-structures-i-e-hashes-within-hashes-td4004285.html). You need to flatten your data!
So
"Viewers": [
{ ViewerId: "22222" },
{ ViewerId: "33333" }
]
is not possible. Instead flatten it and have a ViewerIds array:
"ViewerIds": ["22222", "33333" ]
In your schema, you will have:
<field name="ViewerIds" type="string" indexed="true" stored="true" multiValued="true" />
and modify your data-config accordingly.

Resources