Solr dataimport change dataSource dynamically - solr

I have done the following settings for dataimport from about 20 mdb files using ucanaccess:
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource name="a" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/main.mdb;memory=false" />
<dataSource name="a1" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/1.mdb;memory=false" />
<dataSource name="a2" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/2.mdb;memory=false" />
<dataSource name="a3" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/3.mdb;memory=false" />
<dataSource name="a4" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/4.mdb;memory=false" />
<!-- and so on -->
<document>
<entity name="Book" dataSource="a"
query="select bkid AS id, bkid AS BookID,bk AS BookTitle, betaka AS BookInfo, cat as cat from 0bok">
<field column="id" name="id"/>
<field column="BookID" name="BookID"/>
<field column="BookTitle" name="BookTitle"/>
<field column="cat" name="cat"/>
<entity name="Category" dataSource="a"
query="select name as CatName, catord as CatWeight, Lvl as CatLevel from 0cat where id = ${Book.CAT}">
<field column="CatName" name="CatName"/>
<field column="CatWeight" name="CatWeight"/>
<field column="CatLevel" name="CatLevel"/>
</entity>
<entity name="Pages" dataSource="a5" onError="continue"
query="SELECT nass AS PageContent, page AS pageNum FROM b${Book.ID} ORDER BY page">
<field column="PageContent" name="PageContent" />
<field column="PageNum" name="PageNum" />
<entity name="Titles" dataSource="a5" onError="continue"
query="SELECT * FROM t${Book.ID} WHERE id = ${Pages.PAGE} ORDER BY sub">
<field column="ID" name="TitleID"/>
<field column="TIT" name="PageTitle"/>
<field column="SUB" name="TitleWeight"/>
<field column="LVL" name="TitleLevel"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
In every time I liked to import from a different dataSource I had to change dataSource attribute manually for both Pages and Titles entities, then perform dataimport without clean. Now with more than 600 mdb files, it is not an wise option. Is there any way to make looping inside the config? In other words: there is a main entity or mdb files that handles all books titles and categories then every book has its own mdb file named with its id for example 245.mdb for the book of id 245, So I need to change the dataSource for Pages and Titles dynamically.

You cannot create dataSources in a loop, but I believe you can pass dataSource information in a parameter variable. So, perhaps, you can loop over your collection outside of Solr and then trigger DIH with the correct source as a parameter variable.
Just ensure to run DIH in sync mode to avoid different calls stepping on each other (I think the param is syncMode)

Related

Add dynamic field based of file path

I'm using DIH and Tika to index documents in different languages.
There's a folder for each language (e.g. /de/file001.pdf), and I want to extract the language from path and then dynamically add the language specific solr field (e.g. text_de).
Here's my attempted solution:
<dataConfig>
<script><![CDATA[
function addField(row) {
row.put('text_' + row.get('lang'), row.get('text'));
return row;
}
]]></script>
<dataSource type="BinFileDataSource" />
<document>
<entity name="files" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="/tmp/documents" fileName=".*\.(doc)|(pdf)|(docx)"
onError="skip"
recursive="true"
transformer="RegexTransformer" query="select * from files">
<field column="fileAbsolutePath" name="id" />
<field column="lang" regex=".*/(\w*)/.*" sourceColName="fileAbsolutePath"/>
<entity name="documentImport"
processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}"
format="text"
transformer="script:addField">
<field column="date" name="date" meta="true"/>
<field column="title" name="title" meta="true"/>
</entity>
</entity>
</document>
This doesn't work because row contains the 'text' field but not the 'lang' field.
The approach is correct, however the problem is that you are using a row that has as scope only the current row.
In order to access to parent row, you have to use the context variable that you receive as second actual parameter to script function. The Context variable has the ContextImpl implementation and on each script invocation, Solr ScriptTransformer will send you as second parameter (see transformRow) the same Context instance.
The following script will allow you to extract field value from the parent row and should address your problem:
<dataConfig>
<script><![CDATA[
function addField(row, context) {
var lang = context.getParentContext().resolve('files.lang');
row.put('text_' + row.get('lang'), row.get('text'));
return row;
}
]]></script>

Extract file name (without extension) while indexing using Data Import Handler in Solr

Im successfully able to index pdf,doc,ppt,etc files using the Data Import Handler in solr 4.3.0 .
My data-config.xml looks like this -
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:\Users\aroraarc\Desktop\Impdo"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)|(pptx)|(xls)|(xlsx)|(txt)" onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" />
<field column="file" name="fileName"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
However in the fileName field i want to insert the pure file name without the extension. Eg - Instead of 'HelloWorld.txt' I want only 'HelloWorld' to be inserted in the fileName field. How do I achieve this?
Thanks in advance!
Check ScriptTransformer to replace or change the value before it is indexed.
Example -
Data Config - Add custom field -
<script><![CDATA[
function changeFileName(row){
var fileName= row.get('fileName');
// Replace or remove the extension .. e.g. from last index of .
file_name_new = file_name.replace ......
row.put(fileName, row.get('file_name_new'));
return row;
}
]]></script>
Entity mapping -
<entity name="f" transformer="script:changeFileName" ....>
......
</entity>

Struggling with learning solr

I am in the process of redesigning one of our companies site. My boss wants to play around with the idea of replacing all of our navigation with a search box.. the search box should be able to query any of our tables of unrelated data.
So right now I am trying it with 5 tables.
Products
Manufacturers
Category
Ingredients
Uses
So should be able to lookup a product name, a manufacturer name, a category name, an ingredient name, or a use name
When I retrieve the results. if the user clicked on a manufacturer search result.. It will take them to a manufacturer page that lookups all products for that manufacturer.
When clicks on a product page.. link will take them to that actual product information.
Ingredient will take them to a page that will show all products containing that ingredient.
Anyways here is my data config
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/xxx" user="xxx" password="xxx" />
<document>
<entity name="manufacturer" transformer="TemplateTransformer" pk="manNum"
query="SELECT manNum, manName FROM manufacturer
WHERE active = 1">
<field column="id" name="id" template="MAN-${manNum}" />
<field column="type" template="manufacturer" name="type"/>
<field column="manName" name="text"/>
<field column="manNum" name="manNum"/>
</entity>
<entity name="product" transformer="TemplateTransformer"
query="SELECT products.prodNum, products.prodName as text, m.manName FROM products JOIN man m USING (manNum)
WHERE products.active = 1
AND (hideWeb = 0 or hideWeb IS NULL)">
<field column="id" template="PROD-${products.prodNum}" name="id"/>
<field column="type" template="product" name="type"/>
<field column="text" name="text"/>
<field column="manName" name="manName"/>
</entity>
<entity name="ingredients" transformer="TemplateTransformer" pk="id"
query="SELECT id, text FROM inglist WHERE sort != ''">
<field column="id" name="id" template="ING-${inglist.id}"/>
<field column="type" template="ingredient" name="type"/>
<field column="text" name="text" />
</entity>
<entity name="uses" transformer="TemplateTransformer" pk="id"
query="SELECT id, text FROM useslist">
<field column="id" name="id" template="USE-${id}"/>
<field column="type" template="use" name="type"/>
<field column="text" name="text"/>
</entity>
<entity name="categories" transformer="TemplateTransformer" pk="id"
query="SELECT id, textShow as text FROM categorylist">
<field column="id" name="id" template="CATEGORY-${id}"/>
<field column="type" template="category" name="type"/>
<field column="text" name="text"/>
</entity>
</document>
</dataConfig>
And my schema..
<fields>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="text" indexed="true" stored="true" type="text"/>
<field name="type" type="string" indexed="false" stored="true"/>
<field name="manName" type="text" indexed="false" stored="true"/>
<field name="manNum" type="string" indexed="false" stored="false"/>
</fields>
Now perhaps I am not doing this the right way... and there may be a better way to handle this.
Anyways the problem I am running into right now is that I am getting the error missing required field "id". Now products query and manufacturer query does not have an id column in the select.. but I thought the transform query should take care of it? If I do the select prodNum as id .. then all the ids are overwritting each other.
Now I could probably concat it in the actual query.. and will do so as a last resort, but would like to know what I am doing wrong with this solution.
EDIT
Nevermind, it was just a noob issue, for some reason I was thinking that the template variable was refering to the table name in the SQL not the entity name,
So I replaced all of the
With
And it worked.
Prefixing the table-specific ID with a distinct character or string is a good idea. I do it in the SQL, which allows me to check the behavior outside of Solr.
select
concat('b',cast(b.id as char)) as id,
...
It Was a noob issue,
for some reason I was thinking that the template variable was refering to the table name in the SQL not the entity name.
I do it like this:
<entity name="GG-Boryslaw-1939-Phonebook"
transformer="TemplateTransformer,DateFormatTransformer"
pk="id"
query="SELECT * FROM boryslaw_1939_phonebook">
<field column="record_id" template="GG-Boryslaw-1939-Phonebook-${GG-Boryslaw-1939-Phonebook.id}" />
<field column="record_type" template="phonebook" />
<field column="record_source" template="Boryslaw Phonebook (1939)" />
<field column="record_date" template="${GG-Boryslaw-1939-Phonebook.Year}" dateTimeFormat="yyyy" />
...etc...
</entity>

How to show filenames in search results using Solr's FileListEntityProcessor

I am trying to scan all pdf/doc files in a directory. This works fine and I am able to scan all documents.
The next thing i'm trying to do is also receiving the filename of the file in the search results. The filename however never shows up. I tried a couple of things, but the documentation is not very helpfull about how to do this.
I am using the solr configuration found in the solr distribution: apache-solr-3.1.0/example/example-DIH/solr/tika/conf
This is my dataConfig:
<dataConfig>
<dataSource type="BinFileDataSource" name="bin"/>
<document>
<entity name="f" processor="FileListEntityProcessor" recursive="true"
rootEntity="false" dataSource="null" baseDir="C:/solrtestsmall"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)" onError="skip">
<entity name="tika-test" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" dataSource="bin"
onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="text"/>
</entity>
<field column="fileName" name="fileName"/>
</entity>
</document>
</dataConfig>
I am interested in the way how to configure this correctly, and also the any other places I can find specific documentation.
You should use file instead of fileName in column
<field column="file" name="fileName"/>
Don't forget to add the 'fileName' to the schema.xml in the fields section.
<field name="fileName" type="string" indexed="true" stored="true" />

solr DIH - A problem about solr delta-imports

There is a problem when I use solr1.3 delta-imports to update the index. I have added the "last_modified" column in the table. After I use the "full-import" command to index the database data, the "dataimport.properties" file contains nothing, and when I use the "delta-import" command to update index, the solr list all the data in database not the lasted data. My db-data-config.xml:
deltaQuery="select shop_id from shop where last_modified > '${dataimporter.last_index_time}'">
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/funguide" user="root" password="root"/>
<document name="shopinfo">
<entity name="shop" pk="shop_id"
query="select shop_id,title,description,tel,address,longitude,latitude from shop"
<field column="shop_id" name="id" />
<field column="title" name="title" />
<field column="description" name="description" />
<field column="tel" name="tel" />
<field column="address" name="address" />
<field column="longitude" name="longitude" />
<field column="latitude" name="latitude" />
</entity>
</document>
</dataConfig>
Anyboby know how to solve the problem? Thanks!
enzhaohoo#gmail.com
I also would recommend upgrading to Solr 1.4 RC as there have been quite a few improvements made to delta-imports with DataImportHandler. Please see DataImportHandler - Using delta-import command - wikipage for specifics.

Resources