Efficiency aspect of delta import in solr - solr

I have data of about 2100000 rows. The time taken for full-import is about 2 minutes. For any updates in table I'm using delta import to index the updates. The time taken for delta import is 6 minutes.
Considering the efficiency aspect it is better to do full import rather than delta import. So, what is the need of delta import? Is there any better way to use delta import to increase it's efficiency?
I followed the steps in documentation.
data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.dbschema.CassandraJdbcDriver" url="jdbc:cassandra://127.0.0.1:9042/test" autoCommit="true" rowLimit = '-1' batchSize="-1"/>
<document name="content">
<entity name="test" query="SELECT * from person" deltaImportQuery="select * from person where seq=${dataimporter.delta.seq}" deltaQuery="select seq from person where last_modified > '${dataimporter.last_index_time}' ALLOW FILTERING" autoCommit="true">
<field column="seq" name="id" />
<field column="last" name="last_s" />
<field column="first" name="first_s" />
<field column="city" name="city_s" />
<field column="zip" name="zip_s" />
<field column="street" name="street_s" />
<field column="age" name="age_s" />
<field column="state" name="state_s" />
<field column="dollar" name="dollar_s" />
<field column="pick" name="pick_s" />
</entity>
</document>

The usual way of setting up delta indexing (like you did), runs 2 queries instead of a single one. So in some cases it might not be optimal.
I prefer to setup delta like this, so there is single query to maintain, it's cleaner, and delta runs in a single query. You should try it, it might improve things. The downside is the deletes, you either do some soft-deleting or you still need the usual delta configuration for that (I favour the first).
Also, of course, make sure the last_modified column is properly indexed. I am not familiar with Cassandra jdbc driver, you should double check.
Last thing, if you are using Datastax Entreprise Edition, you can query it via Solr if you configured for that. In this case you could also try indexing off SolrEntityProcessor and with some request param trick you can do full and delta indexing too. I used it succesfully in the past.

Related

SolR 5 : how can I index multiple databases in one core

I'm currently trying to index multiple databases (two MySQL and one PostgreSQL) into one same index so as to make a research on a website.
I've succeeded in importing each Mysql base separatly on different Core (and different indexes).
Edit : The problem is that I have an id for each table and these ones enter in conflict with each other. How can I say that each database have a different ID for Solr ?
Code:
<entity name="database1" dataSource="ds-database1" query="SELECT id, my_column FROM table_database1">
<field column="id" name="id" />
<field column="my_column" name="ts_my_column" />
</entity>
<entity name="database2" dataSource="ds-database2" query="
SELECT id, column_example
FROM table_database2" >
<field column="id" name="id" />
<field column="column_example" name="ts_columnsexample" />
</entity>
You can use a TemplateTransformer to add content in front of a value when using DIH, or you can do it in your SQL:
SELECT CONCAT('db_1_', id) AS id ...
.. or you can do it with a ScriptTransformer if you need even more logic around the transformation.

Foreign key references in Solr dataImportHandler

I've just started using Solr. In my database I have a collection of folders containing two kinds of entities, lets call them barrels and monkeys. Folders contain barrels and barrels contain monkeys. Users should be able to search for barrels and monkeys, but they are only allowed to see certain folders and the search should not return barrels or monkeys in folders they are not allowed to see. I have a filter query which does this fine for the barrels, but I'm having trouble getting the data import handler to import the folder ids for the monkeys. My data-config file looks like this:
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/myDB" user="myUser" password="pass"/>
<document name="item">
<entity name="barrels" query="select * from barrels where is_deleted=0" transformer="TemplateTransformer"
deltaQuery="select barrel_id from barrels where last_modified > '${dataimporter.last_index_time}'">
<field column="itemType" template="barrels" name="itemType"/>
<field column="barrel_id" name="id" pk="true" template="barrel-${barrels.barrel_id}"/>
<!--Other fields-->
<field column="folder_id" name="folder_id"/>
</entity>
<entity name="monkeys" query="select * from monkeys where is_deleted=0" transformer="TemplateTransformer"
deltaQuery="select monkey_id from monkeys where last_modified > '${dataimporter.last_index_time}'">
<field column="itemType" template="monkeys" name="itemType"/>
<field column="monkey_id" name="id" pk="true" template="monkey-${monkeys.monkey_id}"/>
<field column="barrel_id" name="barrel_id"/>
<!--Other fields-->
<entity name="barrels"
query="select folder_id from barrels where barrel_id='${monkeys.barrel_id}'">
<field name="folder_id" column="folder_id" />
</entity>
</entity>
</document>
</dataConfig>
When I change the '${monkeys.barrel_id}' in the foreign key query to 28, it works, but when I try and get it to use the correct id, it doesn't import anything.
Can anyone spot what I'm doing wrong, or tell me a good way to debug this kind of thing? E.g. how can I get it to tell me what value it has for '${monkeys.barrel_id}' ? All the relevant fields are defined in schema.xml. Since having this problem I've made sure the documents all have the same names as the tables, and tried changing various bits of query to upper case, but everything's in lower case in the database and it doesn't seem to help.
Having asked the question, I did manage to figure it out eventually. Here's what I learnt:
1) Getting it to tell you the query is very useful, and it is just a matter of setting the logging level to fine. You have to set it to fine in all the relevant places though. So for my Standalone.xml (in WildFly), in addition to the
<logger category="org.apache.solr">
<level name="FINE"/>
</logger>
bit, I needed to set the file logger and another logging bit to fine. Really should have realised that earlier...
2) The single quotes are not part of the expression evaluation syntax, they are just quotes. So you don't need them when dealing with ints. I guess the example that comes with solr uses string ids rather than int ids and that's why it has the quotes?
3) Once I'd got rid of the quotes, changing the case did make a difference. For my database its preferred case was Barrel_ID for some reason. I hadn't tried it much with capitals at both ends but not in the middle, but that's what worked. So I guess the moral of the story is that it is worthwhile to try lots of different cases even if they seem silly.

Solr: split category data and product data over different cores/instances?

I have a webshop with multiple different productcategories.
For each category I have a description, metadata, image and some more category specific data.
Right now, my data-config.xml looks as below.
However, I think this way I'm indexing all category specific data for each product individually, so taking up a lot more space than needed.
I'm now considering to move the indexing and storing of category specific data to a separate solr core/instance, this way I have basically separated the product specific data and the category data.
Is this reasoning correct? Is it better to move the category specific data outside this core/instance?
<document name="shopitems">
<entity name="shopitem" pk="id" query="select * from products" >
<field name="id" column="ID" />
<field name="articlenr" column="articlenr" />
<field name="title" column="title" />
<entity name="catdescription" query="select
pagetitle_de as cat_pagetitle_de,pagetitle_en as cat_pagetitle_en
,description as cat_description
,metadescription as cat_metadescription
FROM products_custom_cat_descriptions where articlegroup = '${shopitem.articlegroup}'">
</entity>
</entity>
</document>
Generally speaking, your implementation will be easier if you flatten (de-normalize) everything, as you did. If you spin off the categories in a different core, Solr becomes harder to use - you will need extra queries, extra client code, faceting won't work so easily, etc - all of which will result in a performance hit, on top of the extra implementation difficulties.
From the numbers you give (staying under 1GB index size? it's not that big), I would definitely not go the way of splitting out the category data, it will make your life harder, for not much practical gain.

Can fields be nested in Solr?

I need to have fields nested inside of fields, does solr provide that ability ?
For example : I need to have a multivalued field called Products, and each Product needs to in-turn have a multivalued field Properties. I need there to be nesting, so that in case, I search for a property, it only returns the corresponding product info and not all products
Currently, I find that if I have 10 products which each have 10 properties in each doc, upon searching for a property, all the products in that doc(which holds that property) would be returned. And now again I'd have to manually sort out which product had that property, by comparing the array indices. So if property 53 is returned, it would be the 6th product. Thisgets worse when not all products have an equal number of properties.
Is there no easier way ?
Thanks in advance for your replies.
Yes, recent Solr supports nested document. Though, there are some tradeoffs. Mostly, that you had to index and delete the whole parent+children block together. But it should not be a problem for your case.
After that, you can search them in a couple of different ways using BlockJoins.
Not sure if it is useful in your situation but this is what I am doing in my data-config.xml
<document>
<entity name="paper" query="SELECT * FROM papers">
<field column="title" name="title"/>
<field column="title" name="title_unstem"/>
<field column="year" name="publish_date"/>
<entity name="person" query="SELECT * FROM papers_people PA, people A WHERE PA.person_id = A.id AND PA.paper_id='${paper.id}'">
<field column="id" name="author_id"/>
<field column="first_name" name="first_name"/>
<field column="last_name" name="last_name"/>
<field column="full_name" name="author"/>
</entity>
<entity name="volume" query="SELECT * FROM volumes WHERE id='${paper.volume_id}'">
<field column="id" name="volume_id"/>
<field column="title" name="volume_title"/>
<field column="anthology_id" name="volume_anthology"/>
</entity>
</entity>
</document>
Basically as you can see my Paper has many Authors and belongs to a Volume. I am doing this on Ruby on Rails with the Blacklight gem so if you have any questions just ask me.
If this is your key requirements and you haven't invested much in solr, then, I suggest you look at elasticsearch. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-nested-type.html
Otherwise, blockjoin is the only out of the box way to do it in solr and it looks more like a hack.

Solr: How distinguish between multiple entities imported through DIH

When using DataImportHandler with SqlEntityProcessor, I want to have several definitions going into the same schema with different queries.
How can I search both type of entities but also distinguish their source at the same time. Example:
<document>
<entity name="entity1" query="query1">
<field column="column1" name="column1" />
<field column="column2" name="column2" />
</entity>
<entity name="entity2" query="query2">
<field column="column1" name="column1" />
<field column="column2" name="column2" />
</entity>
</document>
How to get data from entity 1 and from entity 2?
As long as your schema fields (e.g. column1, column2) are compatible between different entities, you can just run DataImportHandler and it will populate Solr collection from both queries.
Then, when you query, you will see all entities combined.
If you want to mark which entity came from which source, I would recommend adding another field (e.g. type) and assigning to it different static values in each entity definition using TemplateTransformer.
Also beware of using clean command. By default it deletes everything from the index. As you are populating the index from several sources, you need to make sure it does not delete too much. Use preImportDeleteQuery to delete only entries with the same value in the type field that you set for that entity.

Resources