solr DIH - A problem about solr delta-imports - solr

There is a problem when I use solr1.3 delta-imports to update the index. I have added the "last_modified" column in the table. After I use the "full-import" command to index the database data, the "dataimport.properties" file contains nothing, and when I use the "delta-import" command to update index, the solr list all the data in database not the lasted data. My db-data-config.xml:
deltaQuery="select shop_id from shop where last_modified > '${dataimporter.last_index_time}'">
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/funguide" user="root" password="root"/>
<document name="shopinfo">
<entity name="shop" pk="shop_id"
query="select shop_id,title,description,tel,address,longitude,latitude from shop"
<field column="shop_id" name="id" />
<field column="title" name="title" />
<field column="description" name="description" />
<field column="tel" name="tel" />
<field column="address" name="address" />
<field column="longitude" name="longitude" />
<field column="latitude" name="latitude" />
</entity>
</document>
</dataConfig>
Anyboby know how to solve the problem? Thanks!
enzhaohoo#gmail.com

I also would recommend upgrading to Solr 1.4 RC as there have been quite a few improvements made to delta-imports with DataImportHandler. Please see DataImportHandler - Using delta-import command - wikipage for specifics.

Related

Indexing Text files using apache solr and tika

I have some doc files in d:/tmp/docs location on my local machine and I want to index them using Apache Solr and Tika. Following is my data-config.xml file.
<dataSource type="BinFileDataSource" />
<document>
<entity name="file_Import" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="D:/temp/docs" fileName=".*\.(doc)|(pdf)|(docx)"
onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="id" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<entity
name="documentImport"
processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}"
format="text">
<field column="file" name="fileName"/>
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
When I try to import those files into solr I get following exception:
Caused by: java.net.MalformedURLException: no protocol: null
at java.net.URL.<init>(Unknown Source)
at java.net.URL.<init>(Unknown Source)
at java.net.URL.<init>(Unknown Source)
at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
... 11 more
I figured out that sorl is not able to locate d:/temp/docs folder.
Don't know how to resolve. Any help appreciated.
Resolved ...
I had more than one dataSource tags in my data-config.xml out of which one was <dataSource type="URLDataSource" />
causing a problem.. So I removed all the dataSources and kept only <dataSource type="BinFileDataSource" />
and it worked ... :)
check the url for datasource baseDir
try changing from
baseDir="D:/temp/docs"
to
baseDir="D:/temp/docs/"
and change filename like *.* to index all docs in that folder

Solr dataimport change dataSource dynamically

I have done the following settings for dataimport from about 20 mdb files using ucanaccess:
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource name="a" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/main.mdb;memory=false" />
<dataSource name="a1" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/1.mdb;memory=false" />
<dataSource name="a2" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/2.mdb;memory=false" />
<dataSource name="a3" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/3.mdb;memory=false" />
<dataSource name="a4" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/4.mdb;memory=false" />
<!-- and so on -->
<document>
<entity name="Book" dataSource="a"
query="select bkid AS id, bkid AS BookID,bk AS BookTitle, betaka AS BookInfo, cat as cat from 0bok">
<field column="id" name="id"/>
<field column="BookID" name="BookID"/>
<field column="BookTitle" name="BookTitle"/>
<field column="cat" name="cat"/>
<entity name="Category" dataSource="a"
query="select name as CatName, catord as CatWeight, Lvl as CatLevel from 0cat where id = ${Book.CAT}">
<field column="CatName" name="CatName"/>
<field column="CatWeight" name="CatWeight"/>
<field column="CatLevel" name="CatLevel"/>
</entity>
<entity name="Pages" dataSource="a5" onError="continue"
query="SELECT nass AS PageContent, page AS pageNum FROM b${Book.ID} ORDER BY page">
<field column="PageContent" name="PageContent" />
<field column="PageNum" name="PageNum" />
<entity name="Titles" dataSource="a5" onError="continue"
query="SELECT * FROM t${Book.ID} WHERE id = ${Pages.PAGE} ORDER BY sub">
<field column="ID" name="TitleID"/>
<field column="TIT" name="PageTitle"/>
<field column="SUB" name="TitleWeight"/>
<field column="LVL" name="TitleLevel"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
In every time I liked to import from a different dataSource I had to change dataSource attribute manually for both Pages and Titles entities, then perform dataimport without clean. Now with more than 600 mdb files, it is not an wise option. Is there any way to make looping inside the config? In other words: there is a main entity or mdb files that handles all books titles and categories then every book has its own mdb file named with its id for example 245.mdb for the book of id 245, So I need to change the dataSource for Pages and Titles dynamically.
You cannot create dataSources in a loop, but I believe you can pass dataSource information in a parameter variable. So, perhaps, you can loop over your collection outside of Solr and then trigger DIH with the correct source as a parameter variable.
Just ensure to run DIH in sync mode to avoid different calls stepping on each other (I think the param is syncMode)

Extract file name (without extension) while indexing using Data Import Handler in Solr

Im successfully able to index pdf,doc,ppt,etc files using the Data Import Handler in solr 4.3.0 .
My data-config.xml looks like this -
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:\Users\aroraarc\Desktop\Impdo"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)|(pptx)|(xls)|(xlsx)|(txt)" onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" />
<field column="file" name="fileName"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
However in the fileName field i want to insert the pure file name without the extension. Eg - Instead of 'HelloWorld.txt' I want only 'HelloWorld' to be inserted in the fileName field. How do I achieve this?
Thanks in advance!
Check ScriptTransformer to replace or change the value before it is indexed.
Example -
Data Config - Add custom field -
<script><![CDATA[
function changeFileName(row){
var fileName= row.get('fileName');
// Replace or remove the extension .. e.g. from last index of .
file_name_new = file_name.replace ......
row.put(fileName, row.get('file_name_new'));
return row;
}
]]></script>
Entity mapping -
<entity name="f" transformer="script:changeFileName" ....>
......
</entity>

indexing mutilple tables in solr using DIH

I'm developing search engine using Solr and I've been successful in indexing data from one table using DIH (Dataimport Handler). What I need is to get search result from 5 different tables. I couldn't do this without help.
if we assume x table with x rows, there should be x * x documents from each table, which lead to, 5x documents if I have 5 tables as total. In dataconfig.xml, I created 5 seperate entities in single document as shown below. the result from indexed data when I query *:* is only 6 of the entity users and 3 from entity classes which is the number of users total rows which is 9.
Clearly, this way didn't work for me, so how can I achieve this using only single core?
note: I followed DIHQuickStart and DIH tutorial which didn't help me.
<document>
<!-- Users -->
<entity dataSource="JdbcDataSource" name=" >
<field column="name" name="name" sourceColName="name" />
<field column="username" name="username" sourceColName="username"/>
<field column="email" name="email" sourceColName="email" />
<field column="country" name="country" sourceColName="country" />
</entity>
<!-- Classes -->
<entity dataSource="JdbcDataSource" name="classes" >
<field column="code" name="code" sourceColName="code" />
<field column="title" name="title" sourceColName="title" />
<field column="description" name="description" sourceColName="description" />
</entity>
<!-- Schools -->
<entity dataSource="JdbcDataSource" name="schools" >
<field column="school_name" name="school_name" sourceColName="school_name" />
<field column="country" name="country" sourceColName="country" />
<field column="city" name="city" sourceColName="city" />
</entity>
<!-- Resources -->
<entity dataSource="JdbcDataSource" name="resources" >
<field column="title" name="title" sourceColName="title" />
<field column="description" name="description" sourceColName="description" />
</entity>
<!-- Tasks -->
<entity dataSource="JdbcDataSource" name="tasks" >
<field column="title" name="title" sourceColName="title" />
<field column="description" name="description" sourceColName="description" />
</entity>
</document>
you need to look at the structures of your tables then either create queries with joins or creat nested entities like this for example
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document name="schools">
<entity name="school" query="select * from schools s ">
<field column="school_name" name="school_Name" />
<entity name="school_class" query="select * from schools_classes sc where sc.school_name = '${school.school_name}'">
<field column="class_code" name="class_code" />
<entity name="class" query="select class_name from classes c where c.class_name= '${school_class.class_code}'">
<field column="class_name" name="class_name" />
</entity>
</entity>
</entity>
</document>
</dataConfig>

How to show filenames in search results using Solr's FileListEntityProcessor

I am trying to scan all pdf/doc files in a directory. This works fine and I am able to scan all documents.
The next thing i'm trying to do is also receiving the filename of the file in the search results. The filename however never shows up. I tried a couple of things, but the documentation is not very helpfull about how to do this.
I am using the solr configuration found in the solr distribution: apache-solr-3.1.0/example/example-DIH/solr/tika/conf
This is my dataConfig:
<dataConfig>
<dataSource type="BinFileDataSource" name="bin"/>
<document>
<entity name="f" processor="FileListEntityProcessor" recursive="true"
rootEntity="false" dataSource="null" baseDir="C:/solrtestsmall"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)" onError="skip">
<entity name="tika-test" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" dataSource="bin"
onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="text"/>
</entity>
<field column="fileName" name="fileName"/>
</entity>
</document>
</dataConfig>
I am interested in the way how to configure this correctly, and also the any other places I can find specific documentation.
You should use file instead of fileName in column
<field column="file" name="fileName"/>
Don't forget to add the 'fileName' to the schema.xml in the fields section.
<field name="fileName" type="string" indexed="true" stored="true" />

Resources