Append to a Solr Index - solr

This may be a trivial question but I am trying to append to an existing Solr index and seem to be overwriting what is there every time. I have two databases that I am getting data from and I can import data from each database individually but when I import data from one then immediately import data from the second one, the first is overwritten. I have two dataSources mapped in my db-config.xml file and I am using the standard Admin UI to run the import. My config file looks like this.
<dataConfig>
<dataSource
name="ds-1"
type="JdbcDataSource"
driver="Driver"
url="jdbc_url1"
user="user1"
password="pass1"/>
<dataSource
name="ds-2"
type="JdbcDataSource"
driver="Driver"
url="jdbc_url2"
user="user2"
password="pass2"/>
<document>
<entity name="entity1" dataSource="ds-1" query="SELECT YYY FROM TABLE">
...
</entity>
<entity name="entity2" dataSource="ds-2" query="SELECT ZZZ FROM TABLE">
...
</entity>
</document>
</dataConfig>
What can I do to prevent the original index from being overwritten. I want to incrementally add data from a variety of different sources all the time so having my indexs get wiped does me now good.

Your issue is that you are probably defining the key for your indexed documents to be the primary key id from the database and the values are overlapping. In order to prevent this, you will need to specify a unique id for Solr. Typically when I have run into this issue in the past, I have used a string field as the id field and append a character or two to the id from the database to make it unique. Example: items from Product Table would have ids like P1, P2, etc. and items from Orders Table would have ids like O1, O2, etc.
You should be able to use the Data Import Handler TemplateTransformer to help accomplish this for you.

Related

Dynamic TableName SOLR data import handler

I'm looking to configure SOLR to query a table based on certain data.
I unfortunately have to work with how the Database is setup, but here's what I'm after.
I have a table named Company that will contain a certain "prefix" value.
I want to use that prefix value to determine what tables I should query for the DIH.
As a quick sample:
<entity name="company" query="Select top 1 prefix from Company">
<field name="prefix" column="prefix"/>
<entity name="item" query="select * from ${company.prefix}item">
<field column="ItemID" name="id"/>
<field column="Description" name="description/>
</entity>
</entity>
However I only ever seem to get 1 document processed despite that table containing over 200,000 rows.
what am I doing wrong?
I think you could achieve this by:
using an stored procedure. You can call a sp from DIH as seen here
inside the stored procedure, you can do the table lookup as needed, and then return the results from the real query.
Depending on how good you are with MSSql-s SQL, you might be able to just put everything into a single SQL query and use that directly in DIH, but not sure about that.

Foreign key references in Solr dataImportHandler

I've just started using Solr. In my database I have a collection of folders containing two kinds of entities, lets call them barrels and monkeys. Folders contain barrels and barrels contain monkeys. Users should be able to search for barrels and monkeys, but they are only allowed to see certain folders and the search should not return barrels or monkeys in folders they are not allowed to see. I have a filter query which does this fine for the barrels, but I'm having trouble getting the data import handler to import the folder ids for the monkeys. My data-config file looks like this:
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/myDB" user="myUser" password="pass"/>
<document name="item">
<entity name="barrels" query="select * from barrels where is_deleted=0" transformer="TemplateTransformer"
deltaQuery="select barrel_id from barrels where last_modified > '${dataimporter.last_index_time}'">
<field column="itemType" template="barrels" name="itemType"/>
<field column="barrel_id" name="id" pk="true" template="barrel-${barrels.barrel_id}"/>
<!--Other fields-->
<field column="folder_id" name="folder_id"/>
</entity>
<entity name="monkeys" query="select * from monkeys where is_deleted=0" transformer="TemplateTransformer"
deltaQuery="select monkey_id from monkeys where last_modified > '${dataimporter.last_index_time}'">
<field column="itemType" template="monkeys" name="itemType"/>
<field column="monkey_id" name="id" pk="true" template="monkey-${monkeys.monkey_id}"/>
<field column="barrel_id" name="barrel_id"/>
<!--Other fields-->
<entity name="barrels"
query="select folder_id from barrels where barrel_id='${monkeys.barrel_id}'">
<field name="folder_id" column="folder_id" />
</entity>
</entity>
</document>
</dataConfig>
When I change the '${monkeys.barrel_id}' in the foreign key query to 28, it works, but when I try and get it to use the correct id, it doesn't import anything.
Can anyone spot what I'm doing wrong, or tell me a good way to debug this kind of thing? E.g. how can I get it to tell me what value it has for '${monkeys.barrel_id}' ? All the relevant fields are defined in schema.xml. Since having this problem I've made sure the documents all have the same names as the tables, and tried changing various bits of query to upper case, but everything's in lower case in the database and it doesn't seem to help.
Having asked the question, I did manage to figure it out eventually. Here's what I learnt:
1) Getting it to tell you the query is very useful, and it is just a matter of setting the logging level to fine. You have to set it to fine in all the relevant places though. So for my Standalone.xml (in WildFly), in addition to the
<logger category="org.apache.solr">
<level name="FINE"/>
</logger>
bit, I needed to set the file logger and another logging bit to fine. Really should have realised that earlier...
2) The single quotes are not part of the expression evaluation syntax, they are just quotes. So you don't need them when dealing with ints. I guess the example that comes with solr uses string ids rather than int ids and that's why it has the quotes?
3) Once I'd got rid of the quotes, changing the case did make a difference. For my database its preferred case was Barrel_ID for some reason. I hadn't tried it much with capitals at both ends but not in the middle, but that's what worked. So I guess the moral of the story is that it is worthwhile to try lots of different cases even if they seem silly.

Solr nested data import

I have a master/detail table that I would like to import in Solr so I can query it.
Now it appears to me that only the first row of the detail table is imported.
How do I import all rows from the detail table?
I currently have something like this in my data import handler query:
<entity name="master" query="SELECT id, name, description,
FROM master WHERE isapproved = 1">
<!-- snip -->
<entity name="details" query="SELECT sku,description,price
FROM details WHERE masterid='${master.id}'">
<field column="sku name="sku" />
</entity>
To make it a bit more difficult, sometimes there are only master rows without corresponding detail rows. So I could not reverse the query (select detail first and then master) because that would leave me without the master data.
What is a good solution?
Unfortunatly I do not see your schema.xml, but it is likely that you forgot to mark your document attribute as multiValued="true" there. In that case Solr would only fetch the first value and skip the rest.

In Solr, is it possible to add values for document during indexing, based on a certain field value and a lookup?

Given a text file of unique values, is there an analyzer configuration possible that would use a field of a document that is to be indexed and look it up in the text file, and when found, add a value to another field?
Scenario: products with a unique ID are being indexed, if a product's ID is found in special.txt, then the field 'special' is set to true.
This is for adding occasional information to an index from a manually maintained external data source.
Nope. but you can try for options
Create a New Filter/Analyzer and use it with copyfield with source as product id. Load the file, if a match found add special as the token in the copyfield dest.
Use synonyms text with id=special mapping so that field match if found would have special as the contents.
If using DIH check for ScriptTransformer which will allow you to check value and add a new field
You can use a transformer in your dataconfig
<dataConfig>
<script><![CDATA[
function checkProductID(row) {
if(row.get('ProducID') !== NULL)
{
row.put('special', 1);
}
return row;
}
]]></script>
<document>
<entity name="e" pk="id" transformer="checkProductID">
....
</entity>
</document>
</dataConfig>
The new field (special) must be in schema.xml

Solr: How distinguish between multiple entities imported through DIH

When using DataImportHandler with SqlEntityProcessor, I want to have several definitions going into the same schema with different queries.
How can I search both type of entities but also distinguish their source at the same time. Example:
<document>
<entity name="entity1" query="query1">
<field column="column1" name="column1" />
<field column="column2" name="column2" />
</entity>
<entity name="entity2" query="query2">
<field column="column1" name="column1" />
<field column="column2" name="column2" />
</entity>
</document>
How to get data from entity 1 and from entity 2?
As long as your schema fields (e.g. column1, column2) are compatible between different entities, you can just run DataImportHandler and it will populate Solr collection from both queries.
Then, when you query, you will see all entities combined.
If you want to mark which entity came from which source, I would recommend adding another field (e.g. type) and assigning to it different static values in each entity definition using TemplateTransformer.
Also beware of using clean command. By default it deletes everything from the index. As you are populating the index from several sources, you need to make sure it does not delete too much. Use preImportDeleteQuery to delete only entries with the same value in the type field that you set for that entity.

Resources