Solr: How distinguish between multiple entities imported through DIH

Solr: How distinguish between multiple entities imported through DIH - solr

When using DataImportHandler with SqlEntityProcessor, I want to have several definitions going into the same schema with different queries.
How can I search both type of entities but also distinguish their source at the same time. Example:
<document>
<entity name="entity1" query="query1">
<field column="column1" name="column1" />
<field column="column2" name="column2" />
</entity>
<entity name="entity2" query="query2">
<field column="column1" name="column1" />
<field column="column2" name="column2" />
</entity>
</document>
How to get data from entity 1 and from entity 2?

As long as your schema fields (e.g. column1, column2) are compatible between different entities, you can just run DataImportHandler and it will populate Solr collection from both queries.
Then, when you query, you will see all entities combined.
If you want to mark which entity came from which source, I would recommend adding another field (e.g. type) and assigning to it different static values in each entity definition using TemplateTransformer.
Also beware of using clean command. By default it deletes everything from the index. As you are populating the index from several sources, you need to make sure it does not delete too much. Use preImportDeleteQuery to delete only entries with the same value in the type field that you set for that entity.

Related

Dynamic TableName SOLR data import handler

I'm looking to configure SOLR to query a table based on certain data.
I unfortunately have to work with how the Database is setup, but here's what I'm after.
I have a table named Company that will contain a certain "prefix" value.
I want to use that prefix value to determine what tables I should query for the DIH.
As a quick sample:
<entity name="company" query="Select top 1 prefix from Company">
<field name="prefix" column="prefix"/>
<entity name="item" query="select * from ${company.prefix}item">
<field column="ItemID" name="id"/>
<field column="Description" name="description/>
</entity>
</entity>
However I only ever seem to get 1 document processed despite that table containing over 200,000 rows.
what am I doing wrong?

I think you could achieve this by:
using an stored procedure. You can call a sp from DIH as seen here
inside the stored procedure, you can do the table lookup as needed, and then return the results from the real query.
Depending on how good you are with MSSql-s SQL, you might be able to just put everything into a single SQL query and use that directly in DIH, but not sure about that.

SolR 5 : how can I index multiple databases in one core

I'm currently trying to index multiple databases (two MySQL and one PostgreSQL) into one same index so as to make a research on a website.
I've succeeded in importing each Mysql base separatly on different Core (and different indexes).
Edit : The problem is that I have an id for each table and these ones enter in conflict with each other. How can I say that each database have a different ID for Solr ?
Code:
<entity name="database1" dataSource="ds-database1" query="SELECT id, my_column FROM table_database1">
<field column="id" name="id" />
<field column="my_column" name="ts_my_column" />
</entity>
<entity name="database2" dataSource="ds-database2" query="
SELECT id, column_example
FROM table_database2" >
<field column="id" name="id" />
<field column="column_example" name="ts_columnsexample" />
</entity>

You can use a TemplateTransformer to add content in front of a value when using DIH, or you can do it in your SQL:
SELECT CONCAT('db_1_', id) AS id ...
.. or you can do it with a ScriptTransformer if you need even more logic around the transformation.

Solr: split category data and product data over different cores/instances?

I have a webshop with multiple different productcategories.
For each category I have a description, metadata, image and some more category specific data.
Right now, my data-config.xml looks as below.
However, I think this way I'm indexing all category specific data for each product individually, so taking up a lot more space than needed.
I'm now considering to move the indexing and storing of category specific data to a separate solr core/instance, this way I have basically separated the product specific data and the category data.
Is this reasoning correct? Is it better to move the category specific data outside this core/instance?
<document name="shopitems">
<entity name="shopitem" pk="id" query="select * from products" >
<field name="id" column="ID" />
<field name="articlenr" column="articlenr" />
<field name="title" column="title" />
<entity name="catdescription" query="select
pagetitle_de as cat_pagetitle_de,pagetitle_en as cat_pagetitle_en
,description as cat_description
,metadescription as cat_metadescription
FROM products_custom_cat_descriptions where articlegroup = '${shopitem.articlegroup}'">
</entity>
</entity>
</document>

Generally speaking, your implementation will be easier if you flatten (de-normalize) everything, as you did. If you spin off the categories in a different core, Solr becomes harder to use - you will need extra queries, extra client code, faceting won't work so easily, etc - all of which will result in a performance hit, on top of the extra implementation difficulties.
From the numbers you give (staying under 1GB index size? it's not that big), I would definitely not go the way of splitting out the category data, it will make your life harder, for not much practical gain.

solr sort on an unrelated entity field

My document structure is like this
<document>
<entity name="entity1" query="query1">
<field column="column1" name="column1" />
<!-- more columns specific to this entity -->
</entity>
<entity name="entity2" query="query2">
<field column="column2" name="column2" />
<!-- more columns specific to this entity -->
</entity>
</document>
In my query involving entity1 columns only, if I add entity2 columns in sort clause, why should the result be affected at all? My query is only on entity1 columns which are unrelated to entity2. Is it the case that solr apply the sort clause first on entire "documents" and then apply the query condition(s)?
Documentation reads -
If sortMissingLast="false" and sortMissingFirst="false" (the default),
then default lucene sorting will be used which places docs without the
field first in an ascending sort and last in a descending sort.
Can someone please elaborate on the bolded text?

I think the last paragraph of my question had the answer in it.
If field is missing, default sorting is used which is why my results look "affected".

How do I index rich-format documents contained as database BLOBs with Solr 4.0+?

I've found a few related solutions to this problem. The related solutions will not work for me as I'll explain. (I'm using Solr 4.0 and indexing data stored in an Oracle 11g database.)
Jonck van der Kogel's related solution (from 2009) is explained here. He describes creating a custom Transformer, sort of like the ClobTransformer that ships with Solr. This is going down the elegant path but is not using Tika which is now integrated with Solr. (He uses external PDFBox and FontBox.) This creates multiple maintenance / upgrade dependencies. Also, I need to be able to index Word documents in addition to PDF.
Since Kogel's solutions seems to be on the right path, is there a way to use the Tika classes included with Solr in a custom Transformer? That would allow all the Tika functionality with Kogel's elegant database solution.
Another related solution is the ExtractingRequestHandler (ERH) that ships with Solr. However, as the name suggests, this is a request handler, such as to handle HTTP posts of rich-text documents. To extract documents from the database this way has performance and security problems. I would have to make the database BLOBs accessible via HTTP. I've found no discussion of using ERH for direct ingest from a database BLOB. Is it possible to directly ingest from database BLOBs with Solr Cell?
Another related solution is to write a Transformer (like Kogel's above) to convert a byte[] to a string (from DataImportHandler FAQ). With true binary documents this is going to feed junk into the index and not properly extract the text elements like Tika does. Won't work.
A final related solution is UpdateRichDocuments offered by the RichDocumentHandler. This is deprecated and no longer available in Solr. The page refers you to the ExtractingRequestHandler (discussed above).
It seems like the right solution is to use DataImportHandler and a customer Transformer using the Tika class. How does this work?

Many hours later... First, there is a lot of misleading, wrong and useless information on this problem. No page seemed to provide everything in one place. All of the information is well intentioned but between differing versions and some going over my head, it didn't solve the problem. Here is my collection of what I learned and the solution. To reiterate, I'm using Solr 4.0 (on Tomcat) + Oracle 11g.
Solution overview: DataImportHandler + TikaEntityProcessor + FieldStreamDataSource
Step 1, make sure you update your solrconfig.xml so that solr can find the TikaEntityProcessor + DataImportHandler + Solr Cell stuff.
<lib dir="../contrib/dataimporthandler/lib" regex=".*\.jar" />
<!-- will include extras (where TikaEntPro is) and regular DIH -->
<lib dir="../dist/" regex="apache-solr-dataimporthandler-.*\.jar" />
<lib dir="../contrib/extraction/lib" regex=".*\.jar" />
<lib dir="../dist/" regex="apache-solr-cell-\d.*\.jar" />
Step 2, modify your data-config.xml to include your BLOB table. This is where I had the most trouble since the solutions to this problems have changed a lot as versions have changed. Plus, using multiple data sources and plugging them together correctly was not intuitive to me. Very sleek once it's done though. Make sure to replace your IP, SID name, username, password, table names, etc.
<dataConfig>
<dataSource name="dastream" type="FieldStreamDataSource" />
<dataSource name="db" type="JdbcDataSource"
driver="oracle.jdbc.OracleDriver"
url="jdbc:oracle:thin:#192.1.1.1:1521:sid"
user="username"
password="password"/>
<document>
<entity
name="attachments"
query="select * from schema.attachment_table"
dataSource="db">
<entity
name="attachment"
dataSource="dastream"
processor="TikaEntityProcessor"
url="blob_column"
dataField="attachments.BLOB_COLUMN"
format="text">
<field column="text" name="body" />
</entity>
</entity>
<entity name="unrelated" query="select * from another_table" dataSource="db">
</entity>
</document>
</dataConfig>
Important note here. If you're getting "No field available for name : whatever" errors when you attempt to import, the FieldStreamDataSource is not able to resolve the data field name you gave. For me, I had to have the url attribute with the lower-case column name, and then the dataField attribute with outside_entity_name.UPPERCASE_BLOB_COLUMN. Also, once I had the column name wrong and that will cause the problem as well.
Step 3, you need to modify your schema.xml to add the BLOB-column field (and any other column you need to index/store). Modify according to your needs.
<field name="body" type="text_en" indexed="false" stored="false" />
<field name="attach_desc" type="text_general" indexed="true" stored="true" />
<field name="text" type="text_en" indexed="true" stored="false" multiValued="true" />
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true" />
<copyField source="body" dest="text" />
<copyField source="body" dest="content" />
With that you should be well on your way to saving many hours getting your binary, rich-text documents (aka rich documents) that are stored as BLOBs in a database column indexed with Solr.

The Integration of Tika and DIH is already provided with Solr via TikaEntityProcessor
Integration - SOLR-1358
Blob Handling - SOLR-1737
You need to just find the right combination.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight