solr does not import fields other than id - solr

I am using Solr DataImportHandler module. Here is my config;
<dataConfig>
<dataSource type="JdbcDataSource"
name="sql"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://localhost;databaseName=AdventureWorks2008;integratedSecurity=true;"/>
<document>
<entity name="Person" dataSource="sql"
pk="BusinessEntityID"
query="select BusinessEntityID,FirstName,LastName FROM [Person].[Person]"
deltaImportQuery="select BusinessEntityID,FirstName,LastName FROM [Person].[Person] WHERE id='${dih.delta.id}'"
deltaQuery="SELECT BusinessEntityID FROM [Person].[Person] WHERE ModifiedDate > '${dih.last_index_time}'">
<field column="BusinessEntityID" name="id"/>
<field column="FirstName" name="firstname"/>
<field column="LastName" name="lastname"/>
</entity>
</document>
</dataConfig>
for some reason, only id field is importing but not the rest.
What would be the reason? Am I missing something?

You might have missed the below entries in the schema.xml file
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="firstname" type="string" indexed="true" stored="true"/>
<field name="lastname" type="string" indexed="true" stored="true"/>
Here type for id can be int. Just check what you want.
<field name="id" type="int" indexed="true" stored="true" required="true"/>

Make sure your Id and unique field is Proper.
I was facing same issue, change Pk and unique field name and it's working fine.

Related

How to index related tables in solr

I am trying to index my database for a question answer website. To start off, I want to index the questions and answers table which has a one to many relationship. I would expect solr to return documents like:
{
'question_id': 1,
'question': 'Is this a question?',
'answers' : [
{
'answer_id': 1,
'answer': 'Maybe'
},
{
'answer_id': 2,
'answer': 'yes it is'
}
]
}
What configuration do I need to achieve this?
I've gone through Configuring the DIH Configuration File tutorial.
Below are the configurations I've tried:
CONFIG 1
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/questionsdb" user="root" password=""/>
<document>
<entity name="questions"
pk="id"
query="SELECT id, title FROM questions">
<field column="id" name="question_id"/>
<field column="title" name="title"/>
<entity name="answers"
pk="id"
query="select id, answer from answers where qid='${questions.id}'">
<field name="answer_id" column="id" />
<field name="answer" column="answer" />
</entity>
</entity>
</document>
</dataConfig>
QUERY OUTPUT:
CONFIG 2
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/questionsdb" user="root" password=""/>
<document>
<entity name="questions"
query="SELECT questions.id as question_id, questions.title as question, answers.id as answer_id, answers.answer as answer FROM questions JOIN answers ON questions.id = answers.qid">
<field column="id" name="question_id"/>
<field column="title" name="title"/>
<field name="answer" column="answer" />
<field name="answer_id" column="answer_id" />
</entity>
</document>
</dataConfig>
QUERY OUTPUT:
I'm using solr 8.6.
EDIT 1:
Updated my managed-schema file to use multiValued="true":
<field name="question" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="question_id" type="pint" indexed="false" stored="true" multiValued="false"/>
<field name="answer" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="answer_id" type="pint" indexed="false" stored="true" multiValued="true"/>
The output indexes the answers now but the answer and answer_id come up as a list.
Is it possible to restructure them to be returned as a list of dictionaries as given in the example structure above?

Solr Indexing SQL record with duplicate uniqueKey

We need a full-text search for a db with millions of records (music meta-data) and I've only been working on Solr for 2 weeks roughly, I need some help regarding indexing. I am using DataImportHandler and have SQL query that generates result like this:
As you can see in the attached image above, the id (Integer data type) is repeated in the SQL result also used for in DIH and when I set uniqueKey to <uniqueKey>id</uniqueKey> solr overwites the values leaving only one record/row , in fact I think the last one processed which is the one with countryCode 'TL'.
When I first had this issue, I knew why solr was overwriting the value, its's normal so I thought of adding a global identifer to each record in db, a guid - without thinking things properly, I ended up up with same duplicates as you can see charGuid which is a uuid() from MySQL is duplicated.
But when I use the charGuid (String data type) as uniqueKey to <uniqueKey>charGuid</uniqueKey>, I get all records indexed and nothing is overwritten but of course duplicates are inevitable. The problem I for-see here is when I have to do an incremental update, solr will not be able to know which document to update exactly, In fact a quick test from admin console, revealed that the last or first record its find with that unique key is updated. - This is not acceptable.
I stumbled upon an article referencing multiValued="true", I thought making the fields that represents a JOIN column in my SQL will do the trick, but it doesn't. I was hoping a record with id:10 will be returned with a List of countryCode but no.
I am just puzzled as to how to circumvent this issue and why I did not find a similar problem posted by someone.
If I don't get a meaningful answer, I guess I will have to use charGuid as <uniqueKey> which allows duplicate and then use Solr Document Deduplication Detection to handle updates of my index but I want to believe, there is a better way.
Update
Here is my data-config.xml and schema.xml defination:
<entity name="albums" query="select * from Album">
<entity name="track" query="select t.id as id, t.title as trackTitle, t.removed as trackRemovedDate, t.productState from Track t where t.albumId='${albums.id}'"/>
<entity name="albumSalesAreaId" query="select asa.salesAreaId as albumSalesAreaId from AlbumSalesArea asa where asa.albumId='${albums.id}'"/>
<entity name="albumSalesArea" query="select sa.name as albumSalesArea from SalesArea sa where sa.id='${albumSalesAreaId.salesAreaId}'"/>
<entity name="salesAreaCountry" query="select sac.countryId as 'salesAreaCountry' from SalesAreaCountry sac where sac.salesAreaId ='${salesArea.id}'"/>
<entity name="countryId" query="select c.id as 'countryId' from Country c where c.id = '${salesAreaCountry.countryId}'"/>
<entity name="countryName" query="select c.name as 'countryName' from Country c where c.id = '${salesAreaCountry.countryId}'"/>
</entity>
**Schema.xml**
<!--new multivalue fields -->
<field name="albumSalesArea" type="int" stored="true" indexed="true" multiValued="true"/>
<field name="albumSalesAreaId" type="int" indexed="true" stored="true" multiValued="true"/>
<field name="salesAreaCountry" type="int" stored="true" indexed="true" multiValued="true"/>
<field name="countryId" type="int" indexed="true" stored="true" multiValued="true"/>
<field name="countryName" type="text_general" indexed="true" stored="true" multiValued="true"/>
When I compare my solr response with SQL result, I see countryCode but solr has none, only returned
"albumSalesAreaId": [
1,
3
],
Not sure why country etc not showing up.
Update 2
data-config.xml
<document name="content">
<entity name="albums" query="select * from Album">
<entity name="tracks" query="select t.id, t.title, t.removed, t.productState from Track t where t.albumId='${albums.id}'">
<field column="id" name="id" />
<field column="title" name="trackTitle" />
<field column="removed" name="trackRemovedDate" />
<field column="productState" name="trackProductState" />
</entity>
<entity name="albumSalesAreaIds" query="select salesAreaId from AlbumSalesArea where albumId = '${albums.id}'">
<field column="salesAreaId" name="albumSalesAreaId"/>
</entity>
<entity name="albumSalesAreaNames" query="select name from SalesArea where id = '${albumSalesAreaIds.salesAreaId}'">
<field column="name" name="albumSalesArea"/>
</entity>
<entity name="salesAreaCountryIds" query="select countryId from SalesAreaCountry where salesAreaId ='${albumSalesAreaIds.salesAreaId}'">
<field column="countryId" name="countryId" />
</entity>
<entity name="salesAreaCountry" query="select name from Country where id ='${salesAreaCountryIds.countryId}'">
<field column="name" name="countryName" />
</entity>
<field column="title" name="albumTitle"/>
<field column="removed" name="albumRemovedDate"/>
<field column="productState" name="albumProductState" />
</entity>
</document>
schema.xml
<field name="catchall" type="text_general" stored="true" indexed="true" multiValued="true"/>
<field name="publisher" type="text_general" indexed="true" stored="true"/>
<field name="uuid" type="binary" indexed="false" stored="true"/>
<field name="trackRemovedDate" type="tdate" indexed="true" stored="true"/>
<field name="albumRemovedDate" type="tdate" indexed="true" stored="true"/>
<field name="trackProductState" type="int" indexed="true" stored="true"/>
<field name="albumProductState" type="int" indexed="true" stored="true"/>
<field name="countryCode" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="albumTitle" type="text_general" indexed="true" stored="true"/>
<field name="trackTitle" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="guid" type="text_general" indexed="true" stored="true"/>
<!--new multivalue fields -->
<field name="albumSalesAreaId" type="int" indexed="true" stored="true" multiValued="true"/>
<field name="salesAreaCountry" type="int" stored="true" indexed="true" multiValued="true"/>
<field name="countryId" type="int" indexed="true" stored="true" multiValued="true"/>
<field name="countryName" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="albumSalesArea" type="text_general" indexed="true" stored="true" multiValued="true"/>
sample solr response for id:5
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"indent": "true",
"q": "id:5",
"_": "1383221233535",
"wt": "json"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "5",
"catchall": [
"5",
"Test Album 5",
"2011-10-21 00:00:00.0",
"[B#261ca3cb",
"Test Track 1",
"Ya man 2",
"2011-10-17 16:21:29.0",
"1",
"1450412569164513280"
],
"albumTitle": "Test Album 5",
"albumRemovedDate": "2011-10-21T00:00:00Z",
"uuid": "6oT/MMl+RDaPyKpGK1KN0w==",
"trackTitle": [
"Test Track 1",
"Ya man 2"
],
"trackRemovedDate": "2011-10-17T16:21:29Z",
"albumSalesAreaId": [
1
],
"_version_": 1450412569164513300
}
]
}
}
SQL result for id:5
trackTitle and albumSalesAreaId seem to be correct but not sure why others not been included however if hard code the albumSalesAreaNames entiy with from SalesArea where id = 1, then I get albumSalesArea field added to result, so it seem like from SalesArea where id = '${albumSalesAreaIds.salesAreaId}'" is returning null, also confirmed from by 'IN' test earlier.
This looks really a problem simply solved with a multivalued field.
If you use multivalued field in this structure what you will obtain is one document with ID=10, all the duplicated values will just be there once and all other fields will be multivalued. For example the NAME field will contain 4 different countries and so the country_code.
have a look at this article on how to structure your dataimportHandler to achieve this:
http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example
basically you need one query for each multivalued field:
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document name="products">
<entity name="item" query="select * from item">
<field column="ID" name="id" />
<field column="code" name="code" />
<entity name="countryName" query="select name from countrytable where item_id='${item.ID}'">
<field name="name" column="description" />
</entity>
<entity name="countryCode" query="select countryCode from countrytable where item_id='${item.ID}'">
</entity>
</entity>
</document>
(Posted on behalf of the OP).
SOLUTION
<entity name="albumSalesAreaNames" query="select name from SalesArea where id = '${albumSalesAreaIds.salesAreaId}'">
<field column="name" name="albumSalesArea"/>
</entity>
<field column="salesAreaId" name="albumSalesAreaId"/>
</entity>

solr dataimport not working for URLdatasource

This is my data-config.xml
<dataConfig>
<dataSource name="a" type="URLDataSource" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
<document name="products">
<entity name="images" dataSource="a"
url="file:///abc/1299.xml"
processor="XPathEntityProcessor"
forEach="/imagesList/image"
>
<field column="id" xpath="/imageList/image/productId" />
<field column="image_array" xpath="/imageList/image/imageUrlString" />
</entity>
</document>
</dataConfig>
This is the schema.xml
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="image_array" type="text" indexed="true" stored="true" multivalued="true"/>
But when I try to deltaimport, none of the documents get added.
Any help will be highly appreciated.
Well first off, your XPath says imageList and your XML says imagesList ...

The simplest Solr DIH indexing

I'm trying to index data from a database in Solr using the DIH.
So I have modified the two config files as follows:
solrconfig.xml :
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
data-config.xml :
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test" user="root" password="****"/>
<document>
<entity name="source_scellee" query="select * from source_scellee">
</entity>
</document>
</dataConfig>
source_scellee being the name of my table on my test database. It contains many fields.
Obviously, I'm trying to run nothing else than a simple test. When running http://localhost:8983/solr/dataimport?command=full-import&clean=false&commit=true I get the following result :
<str name="Full Dump Started">2012-01-27 12:27:01</str><str name="">Indexing completed. Added/Updated: 4 documents. Deleted 0 documents.</str><str name="Committed">2012-01-27 12:27:02</str>
<str name="**Total Documents Failed**">4</str>
Besides no warning nor error on the server logs. 4 is my number of records inside table "source_scellee". But it says all documents fail.
If I run a query from http://localhost:8983/solr/admin/
no results appear, at all !! How can I solve it ?
(":" shows no results)
Thank you for your help!!!
----edit---
I have added these lines to my schema.xml :
<field name="ID" type="int" indexed="true" stored="true" />
<field name="reference_catalogue" type="string" indexed="true" stored="true"/>
<field name="reference_capsule" type="string" indexed="true" stored="true"/>
<field name="organisme_certificateur" type="string" indexed="true" stored="true" />
<field name="reference_certificat" type="string" indexed="true" stored="true" />
<field name="duree_d_utilisation" type="string" indexed="true" stored="true" />
<field name="activite_nominale" type="string" indexed="true" stored="true"/>
<field name="activite_minimale" type="string" indexed="true" stored="true"/>
<field name="activite_maximale" type="string" indexed="true" stored="true"/>
<field name="coffret" type="boolean" indexed="true" stored="true"/>
<field name="dispositif_medical" type="boolean" indexed="true" stored="true"/>
<field name="forme_speciale" type="boolean" indexed="true" stored="true" />
<field name="exemption_cpa" type="boolean" indexed="true" stored="true"/>
<field name="marquage_ce" type="boolean" indexed="true" stored="true"/>
<field name="element_cible" type="boolean" indexed="true" stored="true"/>
However the result is still the same: no results when querying (I tried to restart solr, and to re-index all also)
------second edit---
I have tried the dynamic import
Now my data-config.xml looks like this :
<document>
<entity name="source_scellee" query="select * from source_scellee">
<field column="ID" name="ID_i" />
<field column="reference_catalogue" name="reference_catalogue_s" />
<field column="reference_capsule" name="reference_capsule_s" />
<field column="organisme_certificateur" name="organisme_certificateur_s" />
<field column="reference_certificat" name="reference_certificat_s" />
<field column="duree_d_utilisation" name="duree_d_utilisation_s" />
<field column="activite_nominale" name="activite_nominale_s" />
<field column="activite_minimale" name="activite_minimale_s" />
<field column="activite_maximale" name="activite_maximale_s" />
<field column="coffret" name="coffret_b" />
<field column="dispositif_medical" name="dispositif_medical_b" />
<field column="forme_speciale" name="forme_speciale_b" />
<field column="exemption_cpa" name="exemption_cpa_b" />
<field column="marquage_ce" name="marquage_ce_b" />
<field column="element_cible" name="element_cible_b" />
</entity>
</document>
1.) You can take a look to the statistics page to see, how much docs are indexed right now:
http://localhost:8983/solr/admin/stats.jsp
2.) The result of your search depends on your schema.xml, because there it's defined how docs are indexed/stored, which fields are processed and how searchs are handled on query time.
Please take a look at this file or post the field definition from the schema.xml and also the schema/design from your table source_scellee.
Does the columns and the fields have the same name?
//Edit: This should work, if coulmname and filedname are the same:
<document>
<entity name="source_scellee"
pk="ID"
query="select * from source_scellee">
</entity>
</document>
is having NULL values in data an issue ?
that depends on the destination field.
Are your running solr in an tomcat or someting like that?
Take a look in the Java EE Container output, like catalina.out or so.
I am pretty sure the issue lies in how the DIH is trying to map fields. Thanks for adding the information from your schema file... However, I believe that what you have done is added configuration that needs to be added separately to both the schema.xml and the data-config.xml for the DIH.
Based on the Full Import Example from the Solr Wiki, I would try the following.
schema.xml
<field name="ID" type="int" indexed="true" stored="true" />
<field name="reference_catalogue" type="string" indexed="true" stored="true"/>
<field name="reference_capsule" type="string" indexed="true" stored="true"/>
<field name="date_de_creation" type="date" indexed="true" stored="true"/>
<field name="organisme_certificateur" type="string" indexed="true" stored="true" />
<field name="reference_certificat" type="string" indexed="true" stored="true" />
<field name="duree_d_utilisation" type="string" indexed="true" stored="true" />
<field name="activite_nominale" type="string" indexed="true" stored="true"/>
<field name="activite_minimale" type="string" indexed="true" stored="true"/>
<field name="activite_maximale" type="string" indexed="true" stored="true"/>
<field name="coffret" type="int" indexed="true" stored="true"/>
<field name="dispositif_medical" type="int" indexed="true" stored="true"/>
<field name="forme_speciale" type="int" indexed="true" stored="true" />
<field name="exemption_cpa" type="int" indexed="true" stored="true"/>
<field name="marquage_ce" type="int" indexed="true" stored="true"/>
<field name="element_cible" type="int" indexed="true" stored="true"/>
data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test" user="root" password="****"/>
<document>
<entity name="source_scellee" query="select * from source_scellee">
<field column="ID" name="ID"/>
<field column="reference_catalogue" name="reference_catalogue"/>
<field column="reference_capsule" name="reference_capsule"/>
<field column="date_de_creation" name="date_de_creation"/>
<field column="organisme_certificateur" name="organisme_certificateur"/>
<field column="reference_certificat" name="reference_certificat"/>
<field column="duree_d_utilisation" name="duree_d_utilisation"/>
<field column="activite_nominale" name="activite_nominale"/>
<field column="activite_minimale" name="activite_minimale"/>
<field column="activite_maximale" name="activite_maximale"/>
<field column="coffret" name="coffret"/>
<field column="dispositif_medical" name="dispositif_medical"/>
<field column="forme_speciale" name="forme_speciale"/>
<field column="exemption_cpa" name="exemption_cpa"/>
<field column="marquage_ce" name="marquage_ce"/>
<field column="element_cible" name="element_cible"/>
</entity>
</document>
</dataConfig>
There is a way to setup the schema.xml to dynamically add fields that it encounters by using some naming conventions. Please see the Dynamic Fields details in the Solr Wiki for more details and some examples of how this can be done.

Create index on two unrelated table in Solr

I want to create index between two tables, stock and auction. Basically I am working on a product site. So I have to create index on both tables. and they are not related at all.
In data-config.xml, that I created to create index, I wrote the following code
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/database" user="root" password=""/>
<document name="content">
<entity name="stock" query="select ST_StockID,ST_StockCode,ST_Name,ST_ItemDetail from stock where estatus = 'Active' limit 100">
<field column="ST_StockID" name="stock_ST_StockID" />
<field column="ST_StockCode" name="stock_ST_StockCode" />
<field column="ST_Name" name="stock_ST_Name" />
<field column="ST_ItemDetail" name="stock_ST_ItemDetail" />
<entity name="auction" query="select iauctionid,rad_number,vsku,auction_code from auction limit 100">
<field column="iauctionid" name="auction_iauctionid" />
<field column="rad_number" name="auction_rad_number" />
<field column="vsku" name="auction_vsku" />
<field column="auction_code" name="auction_auction_code" />
</entity>
</entity>
</document>
</dataConfig>
and the schema.xml contains the fields are given below.
<field name="stock_ST_StockID" type="string" indexed="true" stored="true" required="true"/>
<field name="stock_ST_StockCode" type="string" indexed="true" stored="true" required="true"/>
<field name="stock_ST_Name" type="string" indexed="true" stored="true" required="true"/>
<field name="stock_ST_ItemDetail" type="text" indexed="true" stored="true" required="true"/>
<field name="auction_iauctionid" type="string" indexed="true" stored="true" required="true"/>
<field name="auction_rad_number" type="string" indexed="true" stored="true" required="true"/>
<field name="auction_vsku" type="string" indexed="true" stored="true" required="true"/>
<field name="auction_auction_code" type="text" indexed="true" stored="true" required="true"/>
But this way the indexes are being created in wrong way as I put the other table data into the first table in data-config.xml. If I create two entity element like given below then the indexes are not being created.
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/lc" user="root" password=""/>
<document name="content">
<entity name="stock" query="select ST_StockID,ST_StockCode,ST_Name,ST_ItemDetail from stock where estatus = 'Active' limit 100">
<field column="ST_StockID" name="stock_ST_StockID" />
<field column="ST_StockCode" name="stock_ST_StockCode" />
<field column="ST_Name" name="stock_ST_Name" />
<field column="ST_ItemDetail" name="stock_ST_ItemDetail" />
</entity>
<entity name="auction" query="select iauctionid,rad_number,vsku,auction_code from auction limit 100">
<field column="iauctionid" name="auction_iauctionid" />
<field column="rad_number" name="auction_rad_number" />
<field column="vsku" name="auction_vsku" />
<field column="auction_code" name="auction_auction_code" />
</entity>
</document>
</dataConfig>
I did not get your answer, can you pls elaborate a little more. I also have the same requirement. I have two tables stock and auction. Basically I am working on a product site. So I have to create index on both tables. and they are not related at all.
Please help
Do you get any errors when indexing the data ??
The following data config is fine as you have two unrelated items.
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/lc" user="root" password=""/>
<document name="content">
<entity name="stock" query="select ST_StockID,ST_StockCode,ST_Name,ST_ItemDetail from stock where estatus = 'Active' limit 100">
<field column="ST_StockID" name="stock_ST_StockID" />
<field column="ST_StockCode" name="stock_ST_StockCode" />
<field column="ST_Name" name="stock_ST_Name" />
<field column="ST_ItemDetail" name="stock_ST_ItemDetail" />
</entity>
<entity name="auction" query="select iauctionid,rad_number,vsku,auction_code from auction limit 100">
<field column="iauctionid" name="auction_iauctionid" />
<field column="rad_number" name="auction_rad_number" />
<field column="vsku" name="auction_vsku" />
<field column="auction_code" name="auction_auction_code" />
</entity>
</document>
</dataConfig>
However, there are few things missing ?
Whats the id field for the entity ? As each document should have a unique id, the configuration seems missing above.
Also the id should be unqiue for the entites, else the stock and auction should overwrite each other.
So you may want the id append as stock_ & auction_
You can also add a static field as Stock and auction to your schema and populate them, which would help you the filter out the results when searching and hence improve the performance.
For Assigning the Ids -
You can use the following to create the id value - This should append the Stock_ with the ST_StockID field value.
<field column="id" template="Stock_#${stock.ST_StockID}" />
OR
Use alias in sql e.g. SELECT 'Stock_' || ST_StockID AS ID ..... as use -
<field column="id" name="id" />

Resources