Solr BlockJoin Indexing for Solr 4.10.1 - solr

I am trying to index a nested structure as below and having difficulty indexing both with SOlrJ and the DIH. I have battled with this for a while and would really appreciate some help on this.
How do i fix this with either SolrJ or DIH.
Thanks
What i want my data to look like my index:
"docs": [
{
"name": "MR INCREDIBLE ",
"id": 101,
"job": "super hero",
"_version_": "1483934897344086016"
"children": [
{
"c_name":"Violet"
"c_age":10
"c_gender":"female"
},
{
"c_name":"Dash"
"c_age":8
"c_gender":"male"
}
]
}
]
My schema.xml
<schema name="datasearch" version="1.5">
<uniqueKey>id</uniqueKey>
<fields>
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" />
<field name="name" type="text" indexed="true" stored="true" />
<field name="job" type="string" indexed="true" stored="true"/>
<!-- I want to add children here -->
<!-- <field name="children" indexed="true" stored="true"/> -->
<field name="c_name" type="string" indexed="true" stored="true"/>
<field name="c_age" type="int" indexed="true" stored="true"/>
<field name="c_sex" type="string" indexed="true" stored="true"/>
</fields>
<types>
<fieldType name="string" class="solr.TrieLongField" />
<fieldType name="int" class="solr.TrieIntField" />
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" />
<fieldType name="long" class="solr.StrField" sortMissingLast="true"/>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
</types>
<defaultSearchField>name</defaultSearchField>
</schema>
SolrJ Attempt
val serverUrl = current.configuration.getString("solr.server.url").get
val solr = new HttpSolrServer(serverUrl)
def testAddChildDoc={
val doc = {
new SolrInputDocument(){
addField("id", "101")
addField("name", "Mr Incredible")
}
}
val c1 = new SolrInputDocument(){
addField("c_name", "violet")
addField("c_age", 10)
}
val c2 = new SolrInputDocument(){
addField("c_name", "dash")
addField("c_age", 8)
}
doc.addChildDocument(c1)
doc.addChildDocument(c2)
solr.deleteByQuery("*:*")
solr.add(doc)
solr.commit(true, true)
}
Response
=>ERROR org.apache.solr.core.SolrCore – org.apache.solr.common.SolrException: [doc=null] missing required field: id
[RemoteSolrException: [doc=null] missing required field: id]
So i go ahead and add id to childDocs making the above
...
val c1 = new SolrInputDocument(){
addField("id", "101")
addField("c_name", "violet")
addField("c_age", 10)
}
val c2 = new SolrInputDocument(){
addField("id", "101")
addField("c_name", "dash")
addField("c_age", 8)
}
.....
Then rerun the get-all query, now i get the results below
SolrJ Attempt 2 plus get-all query
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"indent": "true",
"q": "*:*",
"_": "1415194092582",
"wt": "json"
}
},
"response": {
"numFound": 3,
"start": 0,
"docs": [
{
"id": 101,
"c_name": violet,
"c_age": "10",
},
{
"id": 101,
"c_name": dash,
"c_age": "8"
},
{
"id": 101,
"name": "Mr Incredible",
"_version_": "1483938552238571520"
}
]
}
}
So i give up here and try the DIH as below
db-dataconfig.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://xxx:5432/xxxx"
user="xx" password="xx"
readOnly="true" autoCommit="false" transactionIsolation="TRANSACTION_READ_COMMITTED" holdability="CLOSE_CURSORS_AT_COMMIT" />
<document>
<entity name="parent" query="select id,name, job from PARENTS LIMIT 1" >
<field column="name"/>
<field column="id"/>
<field column="job"/>
<entity child="true" name="children" query="select c_name, c_gender, c_age from CHILDREN" where="pid = ${parent.id}" processor="CachedSqlEntityProcessor">
<field column="c_age" />
<field column="c_gender" />
<field column="c_name"/>
</entity>
</entity>
</document>
</dataConfig>
query get-all after full import with DIH as above and no children indexed
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"indent": "true",
"q": "*:*",
"_": "1415195060664",
"wt": "json"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"name": "Mr Incredible",
"id": 101,
"_version_": "1483939357483073536"
}
]
}
}

To be able to use child="true" in DIH apply the patch from https://issues.apache.org/jira/browse/SOLR-5147 (I think it's the same DIH patch at solr-3076).
The patch itself seems to be incompatible in neglectable details with the current trunk.

In order to get the following response from Solr 4.10.1
{
"name": "MR INCREDIBLE ",
"id": 101,
"job": "super hero",
"type": "parent",
"_root_":"101"
"_version_": "1483934897344086016"
"childDocuments": [
{
"c_name":"Violet",
"c_age":10,
"c_gender":"female",
"id":"101_Violet",
"_root_":"101"
},
{
"c_name":"Dash",
"c_age":8,
"c_gender":"male",
"id":"101Dash",
"_root_":"101"
}
]
}
"type" field needs to be defined in the schema to differentiate between parent and child documents:
<fields>
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" />
<field name="name" type="text" indexed="true" stored="true" />
<field name="job" type="string" indexed="true" stored="true"/>
<field name="c_name" type="string" indexed="true" stored="true"/>
<field name="c_age" type="int" indexed="true" stored="true"/>
<field name="c_gender" type="string" indexed="true" stored="true"/>
<field name="type" type="string" indexed="true" stored="true" />
</fields>
Child documents also need to have an unique "id", just like any other document.
All the documents in the index should be in parent/child relation, otherwise the queries may return unexpected results. In case you need documents which are neither parents or children, assign them a fake parent.
SolrJ
To work with child/parent docs, solrj.jar version 4.5 or higher is required.
SolrServer solr = new HttpSolrServer(serverUrl);
SolrInputDocument doc = new SolrInputDocument();
String id = "101";
doc.addField("id", id);
doc.addField("name", "Mr Incredible");
doc.addField("job", "super hero");
doc.addField("type", "parent");
SolrInputDocument childDoc1 = new SolrInputDocument();
String name1 = "Violet";
childDoc1.addField("id", id + "_" + name1);
childDoc1.addField("c_name", name1);
childDoc1.addField("c_age", 10);
childDoc1.addField("c_gender", "female");
doc.addChildDocument(childDoc1);
SolrInputDocument childDoc2 = new SolrInputDocument();
String name2 = "Dash";
childDoc2.addField("id", id + "_" + name2);
childDoc2.addField("c_name", name2);
childDoc2.addField("c_age", 8);
childDoc2.addField("c_gender", "male");
doc.addChildDocument(childDoc2);
solr.add(doc);
solr.commit();
Finally, the query looks like this:
http://localhost/solr/core/select?q={!parent which='type:parent'}&fl=*,[child parentFilter=type:parent]&wt=json&indent=true
To get only results of female gender:
http://localhost/solr/core/select?q={!parent which='type:parent'}c_gender:female&fl=*,[child parentFilter=type:parent childFilter=c_gender:female]&wt=json&indent=true

Related

Importing Nested Documents in Solr using DataImportHandler

I am working on a project where the specification requires a parent - child relationship within the Solr data collection ... i.e. a user and the collection of languages they speak (each of which is made up of multiple data fields). My production system is a 4.10 Solr implementation but I have a 5.5 implementation as my disposal as well. Thus far, I am not getting this to work on either one and I have yet to find a complete documentation source on how to implement this.
The goal is to get a resulting document from Solr that looks like this:
{
"id": 123,
"firstName": "John",
"lastName": "Doe",
"languagesSpoken": [
{
"id": 243,
"abbreviation": "en",
"name": "English"
},
{
"id": 442,
"abbreviation": "fr",
"name": "French"
}
]
}
In my schema.xml, I have flatted out all of the fields as follows:
<field name="id" type="int" indexed="true" stored="true" required="true" multiValued="false" />
<field name="firstName" type="text_general" indexed="true" stored="true" />
<field name="lastName" type="text_general" indexed="true" stored="true" />
<field name="languagesSpoken" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="languagesSpoken_id" type="int" indexed="true" stored="true" />
<field name="languagesSpoken_abbreviation " type="text_general" indexed="true" stored="true" />
<field name="languagesSpoken_name" type="text_general" indexed="true" stored="true" />
The latest rendition of my db-data-config.xml looks like this:
<dataConfig>
<dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:...." />
<document name="clients">
<entity name="client" query="SELECT * FROM clients" deltaImportQuery="SELECT * FROM clients WHERE id = ${dih.delta.id}" deltaQuery="SELECT id FROM clients WHERE updateDate > '${dih.last_index_time}'">
<field column="id" name="id" />
<field column="firstName" name="firstName" />
<field column="lastName" name="lastName" />
<entity name="languagesSpoken" child="true" query="SELECT id, abbreviation, name FROM languages WHERE clientId = ${client.id}">
<field name="languagesSpoken_id" column="id" />
<field name="languagesSpoken_abbreviation" column="abbreviation" />
<field name="languagesSpoken_name" column="name" />
</entity>
</entity>
</document>
...
On the 4.10 server, when the data comes out of Solr, I get one flat document record with the fields for one language inline with the firstName and lastname like this:
{
"id": 123,
"firstName": "John",
"lastName": "Doe",
"languagesSpoken_id": 243,
"languagesSpoken_abbreviation ": "en",
"languagesSpoken_name": "English"
}
On the 5.5 server, when the data comes out, I get separate documents for the root client document and the child language documents with no relationship between them like this:
{
"id": 123,
"firstName": "John",
"lastName": "Doe"
},
{
"languagesSpoken_id": 243,
"languagesSpoken_abbreviation": "en",
"languagesSpoken_name": "English"
},
{
"languagesSpoken_id": 442,
"languagesSpoken_abbreviation": "fr",
"languagesSpoken_name": "French"
}
I have spent several days now trying to figure out what is going on here to no avail. Can anybody provide me with a pointer as to what I am missing here?
Thanks,
-- Jeff
You may want to flatten your json objects like below before you import into SOLR;
https://stackoverflow.com/a/19101235/929902
POST http://localhost:8983/solr/ggg_core/update?boost=1.0&commitWithin=1000&overwrite=true&wt=json HTTP/1.1
Then once you read from SOLR, you can unflatten it in similar way.

Solr query only returns Id only

I want to retrieve the name from the user but it returns the id only.
I am using solr5.5.0
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://server:3306/dbname" user="user" password="pwd"/>
<document name="user">
<entity name="user" query="select id,name from user">
<field column="id" name="id"/>
<field column="name" name="name"/>
</entity>
</document>
</dataConfig>
<field type="int" indexed="true" stored="true" name="id" />
<field multiValued="true" name="name" type="text" indexed="true" stored="true" />
output
response:
{
"numFound": 38,
"start": 0,
"docs": [
{
"id": "1",
"_version_": 1527443171669180400
},
{
"id": "3",
"_version_": 1527443171672326100
},

Solr Data Import - array of strings

Hi can anybody point me in the right direction for using Solr's Data Import Handler (DIH) to create an array of strings based on the SQL query.
My Solr DIH config looks like this:
<dataConfig>
<dataSource driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost:5432/data"
user="xxxxx"
password="xxxxxx" />
<document>
<entity name="item" query="select id, subject from table1">
<field column="id" name="id" />
<field column="subject" name="subject" />
<entity name="ip_address" query="select ip_address from table2 where id='${item.id}'">
<field column="ip_address" name="ip_address" />
</entity>
</entity>
</document>
</dataConfig>
The query on table2 actually returns multiple items so I need this to be reflected in my documents.
e.g. :
{
"numFound": 1,
"start": 0,
"docs": [
{
"id": "29331109",
"subject": "Test document",
"ip_address": [
"88.103.210.139",
"88.103.210.144",
"88.103.210.133"
],
"_version_": 1468439879154139100
}
]
}
This is almost working for me except that Solr is only populating the first ip_address in my documents.
Here's the relevant part of my Schema:
<!-- Custom Field names -->
<field name="serial_number" type="string" indexed="true" stored="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="ip_address" type="string" indexed="true" stored="true" multiValued="true"/>
How is the "ip_address" field defined in schema.xml? It should be multiValued field.

Many-to-one mapping within Apache Solr

I am using Solr to index my database of reports. Reports can have text, submitter information, etc. This currently works and looks like this:
"docs": [
{
"Text": "Some Report Text"
"ReportId": "1",
"Date": "2013-08-09T14:59:28.147Z",
"SubmitterId": "11111",
"FirstName": "John",
"LastName": "Doe",
"_version_": 1444554112206110700
}
]
The other thing a report can have is viewers (which is a one-to-many relationship between a single report and the viewers.) I want to be able to capture those viewers like this in my JSON output:
"docs": [
{
"Text": "Some Report Text"
"ReportId": "1",
"Date": "2013-08-09T14:59:28.147Z",
"SubmitterId": "11111",
"FirstName": "John",
"LastName": "Doe",
"Viewers": [
{ ViewerId: "22222" },
{ ViewerId: "33333" }
]
"_version_": 1444554112206110700
}
]
I cannot seem to get that to happen, however. Here is my data-config.xml (parts removed that aren't necessary to the question):
<entity name="Report" query="select * from Reports">
<field column="Text" />
<field column="ReportId" />
<!-- Get Submitter Information as another entity. -->
<entity name="Viewers" query="select * from ReportViewers where Id='${Report.ReportId}'">
<field column="Id" name="ViewerId" />
</entity>
</entity>
And the schema.xml:
<field name="Text" type="text_en" indexed="true" stored="true" />
<field name="ReportId" type="string" indexed="true" stored="true" />
<field name="Viewers" type="string" indexed="true" stored="true" multiValued="true" />
<field name="ViewerId" type="string" indexed="true" stored="true" />
When I do the data import, I just don't see anything. No errors, nothing apparently wrong, but I'm pretty sure my data-config and/or my schema are not correct. What am I doing wrong?
Unfortunately Solr does not allow nesting (see http://lucene.472066.n3.nabble.com/Possible-to-have-Solr-documents-with-deeply-nested-data-structures-i-e-hashes-within-hashes-td4004285.html). You need to flatten your data!
So
"Viewers": [
{ ViewerId: "22222" },
{ ViewerId: "33333" }
]
is not possible. Instead flatten it and have a ViewerIds array:
"ViewerIds": ["22222", "33333" ]
In your schema, you will have:
<field name="ViewerIds" type="string" indexed="true" stored="true" multiValued="true" />
and modify your data-config accordingly.

SOLR 4.0 alphabetical sorting trouble

I'm having a hard time of getting my head around an issue I have with my SOLR address database.
I built this one up from the example files. I'm basically running the example configuration with a modified schema.
schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="long" indexed="true" stored="true" required="false" multiValued="false" />
<field name="givenname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="middleinitial_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="surname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="gender_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="pictureuri_s" type="string" indexed="false" stored="true" required="false" multiValued="false" />
<field name="function_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunit_s" type="text_general" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunitdescription_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="company_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="street_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="streetnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="postcode_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="city_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="building_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="roomnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="country_s" type="text_en" indexed="true" stored="true" required="true" multiValued="false" />
<field name="countrycode_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="emailaddress_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone1_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone2_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="mobile_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="fax_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
I am populating the database by pushing about 20.000 random test datasets like the following to post.jar:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<add>
<doc>
<field name="id">1352498443_1</field>
<field name="givenname_s">Aynur</field>
<field name="middleinitial_s"/>
<field name="surname_s">Lehnen</field>
<field name="gender_s">F</field>
<field name="pictureuri_s">dummy_assets/female.jpg</field>
<field name="function_s">Zugschaffner/in</field>
<field name="organizationalunit_s">P 07</field>
<field name="organizationalunitdescription_s">Lorem Ipsum sadipscing voluptua ipsum invidunt dolor et dolore invidunt sed consetetur accusam dolore Lorem tempor.</field>
<field name="company_s">Lorem Lagna Epsum Emet</field>
<field name="street_s">Erlenweg</field>
<field name="streetnumber_s">82</field>
<field name="postcode_s">76297</field>
<field name="city_s">Lübeck</field>
<field name="building_s"/>
<field name="roomnumber_s">242</field>
<field name="country_s">GERMANY</field>
<field name="countrycode_s">DE</field>
<field name="emailaddress_s">aynur.lehnen#lorem-lagna-epsum-emet.de</field>
<field name="phone1_s">0392984823</field>
<field name="phone2_s">0124111417</field>
<field name="mobile_s">0325117132</field>
<field name="fax_s">0171459177</field>
</doc>
</add>
However when retreiving data I seem to have problems with alphabetical sorting. Consider the folowing query:
{
"responseHeader": {
"status": 0,
"QTime": 5,
"params": {
"sort": "surname_s asc",
"fl": "surname_s",
"indent": "true",
"wt": "json",
"q": "city_s:berlin"
}
},
"response": {
"numFound": 1094,
"start": 0,
"docs": [{
"surname_s": "Weil"
}, {
"surname_s": "Abel"
}, {
"surname_s": "Adam"
}, {
"surname_s": "Ade"
}, {
"surname_s": "Adrian"
}, {
"surname_s": "Aigner"
}, {
"surname_s": "Aigner"
}, {
"surname_s": "Alber"
}, {
"surname_s": "Alber"
}, {
"surname_s": "Albers"
}]
}
}
Why is "Weil" on position one, while the rest of the data appears to be sorted correctly?
I believe that some of the additional analyzers that are being applied in the text_de field type are the cause for this sorting behavior. In my experience, for the best results when sorting strings is to use the alphaOlySort fieldType that comes with the example schema.xml shown below.
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<!-- The PatternReplaceFilter gives you the flexibility to use
Java Regular expression to replace any sequence of characters
matching a pattern with an arbitrary replacement string,
which may include back references to portions of the original
string matched by the pattern.
See the Java Regular Expression documentation for more
information on pattern and replacement string syntax.
http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html
-->
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
I would recommend creating a new field and then copying the value from surname_s via copyField, something like the following:
<field name="surname_s_sort" type="alphaOnlySort" indexed="true" stored="false" required="false" multiValued="false" />
<copyField source="surname_s" dest="surname_s_sort"/>
Note: there is not any need to store the value in the surname_s_sort field, hence the stored="false" attribute, unless you expect to display that to the users.
Then you can just change your query to sort on the surname_s_sort instead.
Sorting doesn't work well on multivalued and tokenized fields.
Documentation -
Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)
Use string as the field type and copy the title field into the new field.
<field name="surname_s_sort" type="string" indexed="true" stored="false"/>
<copyField source="surname_s" dest="surname_s_sort" />
As #Paige answered you can have keyword tokenizer, lower case filters which do not tokenize the field.
I had similiar issues and I tried the alphaOnlySort. This work for some part, but it starts messing up the sort results when the field contains values like -,/ spaces etc.
So the result was something like
/ abc
aa
/ abc2
So I ended up using the field type lowercase. It was already there so I figured that its a default type. I did use the copy field construction, so my final config was:
<schema>
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fields>
<field name="job_name_sort" type="lowercase" indexed="true" stored="false" required="false"/>
</fields>
<copyField source="job_name" dest="job_name_sort"/>
</schema>

Resources