Indexing XML document in Solr - solr

XML file is not getting indexed in Solr. For field Org in xml file we have an attribute id. data-config.xml is
:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="OrgDBdir" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor"
fileName="orgdb.xml" recursive="true"
baseDir="/var/www/html/intranet"
>
<entity name="OrgDB"
processor="XPathEntityProcessor"
stream="true"
useSolrAddSchema="true"
pk="id"
datasource="OrgDBdir"
forEach="/Orgs/Org"
url="${OrgDBdir.fileAbsolutePath}"
transformer="DateFormatTransformer, RegexTransformer">
<field column="id" xpath="/Orgs/Org/[#id]"/>
<filed column="orgname" xpath="/Orgs/Org/OrgName"/>
<filed column="officialname" xpath="/Orgs/Org/OfficialName"/>
</entity>
</entity>
</document>
</dataConfig>

Related

Solr DataImport set field to specific value [duplicate]

Im making an index in solr from db in the following way:
<document name="Index">
<entity name="c" query="SELECT * FROM C">
<field column="Name" name="name"/>
</entity>
<entity name="p" query="SELECT * FROM P">
<field column="Name" name="name"/>
</entity>
</document>
Is it possible to have a static field that is set for each row that signify what type is returned to client so that one can make a call to the right database table based on that information from the json result?
That is a field that has no column in the table
<field name="id" value="1"/>
Or is there another way to solve this?
<document name="Index">
<entity name="c" transformer="TemplateTransformer" query="SELECT * FROM C">
<field column="Name" name="name"/>
<field column="id" template="1"/>
</entity>
<entity name="p" transformer="TemplateTransformer" query="SELECT * FROM P">
<field column="Name" name="name"/>
<field column="id" template="1"/>
</entity>
</document>
You can add a column to your SQL query that contains static data like this:
<document name="Index">
<entity name="c" query="SELECT *, 'foo' as NameFromC FROM C">
<field column="NameFromC" name="name"/>
</entity>
<entity name="p" query="SELECT *, 'bar' as NameFromP FROM P">
<field column="NameFromP" name="name"/>
</entity>
</document>
If you try to add a field with only name and template attributes, Solr will throw an error saying Field must have a column attribute.

Solr dataimport change dataSource dynamically

I have done the following settings for dataimport from about 20 mdb files using ucanaccess:
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource name="a" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/main.mdb;memory=false" />
<dataSource name="a1" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/1.mdb;memory=false" />
<dataSource name="a2" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/2.mdb;memory=false" />
<dataSource name="a3" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/3.mdb;memory=false" />
<dataSource name="a4" driver="net.ucanaccess.jdbc.UcanaccessDriver" type="JdbcDataSource" url="jdbc:ucanaccess://E:/feqh/A/4.mdb;memory=false" />
<!-- and so on -->
<document>
<entity name="Book" dataSource="a"
query="select bkid AS id, bkid AS BookID,bk AS BookTitle, betaka AS BookInfo, cat as cat from 0bok">
<field column="id" name="id"/>
<field column="BookID" name="BookID"/>
<field column="BookTitle" name="BookTitle"/>
<field column="cat" name="cat"/>
<entity name="Category" dataSource="a"
query="select name as CatName, catord as CatWeight, Lvl as CatLevel from 0cat where id = ${Book.CAT}">
<field column="CatName" name="CatName"/>
<field column="CatWeight" name="CatWeight"/>
<field column="CatLevel" name="CatLevel"/>
</entity>
<entity name="Pages" dataSource="a5" onError="continue"
query="SELECT nass AS PageContent, page AS pageNum FROM b${Book.ID} ORDER BY page">
<field column="PageContent" name="PageContent" />
<field column="PageNum" name="PageNum" />
<entity name="Titles" dataSource="a5" onError="continue"
query="SELECT * FROM t${Book.ID} WHERE id = ${Pages.PAGE} ORDER BY sub">
<field column="ID" name="TitleID"/>
<field column="TIT" name="PageTitle"/>
<field column="SUB" name="TitleWeight"/>
<field column="LVL" name="TitleLevel"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
In every time I liked to import from a different dataSource I had to change dataSource attribute manually for both Pages and Titles entities, then perform dataimport without clean. Now with more than 600 mdb files, it is not an wise option. Is there any way to make looping inside the config? In other words: there is a main entity or mdb files that handles all books titles and categories then every book has its own mdb file named with its id for example 245.mdb for the book of id 245, So I need to change the dataSource for Pages and Titles dynamically.
You cannot create dataSources in a loop, but I believe you can pass dataSource information in a parameter variable. So, perhaps, you can loop over your collection outside of Solr and then trigger DIH with the correct source as a parameter variable.
Just ensure to run DIH in sync mode to avoid different calls stepping on each other (I think the param is syncMode)

Solr - DataImportHandler Not Working

I've simple DataImportHandler, that is working on my local system and not on my Server. Both the versions of Solr are same i.e Solr 4.6.0.
I've tried these configurations for DataImportHandler:
Configuration 1:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://HOST:5432/mydb"
user="admin"
password="admin" />
<script><![CDATA[
function generate_resource_uri(row) {
row.put('resource_uri', '/api/v1/product/' + row.get('id') + '/');
return row;
}
]]></script>
<document>
<entity name="products_product"
query="SELECT id, image_url, impression_url, product_url, manufacturer_name, discount_percentage, short_description, merchant_name, product_name, sku, long_description, date_modified, merchant_id, commission, keywords, product_id, retail_price, date_created FROM products_product"
transformer="script:generate_resource_uri" >
<entity name="source" query="select title from products_source where id = '${products_product.id}'"
processor="CachedSqlEntityProcessor">
<field column="title" name="source"/>
</entity>
<entity name="currency" query="select code from products_currency where id = '${products_product.id}'"
processor="CachedSqlEntityProcessor">
<field column="code" name="currency"/>
</entity>
<entity name="category" query="select title from products_category where id = '${products_product.id}'"
processor="CachedSqlEntityProcessor">
<field column="title" name="category"/>
</entity>
</entity>
</document>
</dataConfig>
Configuration 2:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://HOST:5432/mydb"
user="admin"
password="123456"/>
<script><![CDATA[
function generate_resource_uri(row) {
row.put('resource_uri', '/api/v1/product/' + row.get('id') + '/');
return row;
}
]]></script>
<document>
<entity name="products_product"
query="SELECT id, image_url, impression_url, product_url, manufacturer_name, discount_percentage, short_description, merchant_name, product_name, sku, long_description, date_modified, merchant_id, commission, keywords, product_id, retail_price, date_created FROM products_product"
transformer="script:generate_resource_uri" >
<entity name="source" query="select title from products_source where id = '${products_product.id}'"
cachePk="id" cacheLookup="products_product.id" cacheImpl="SortedMapBackedCache">
<field column="title" name="source"/>
</entity>
<entity name="currency" query="select code from products_currency where id = '${products_product.id}'"
cachePk="id" cacheLookup="products_product.id" cacheImpl="SortedMapBackedCache">
<field column="code" name="currency"/>
</entity>
<entity name="category" query="select title from products_category where id = '${products_product.id}'"
cachePk="id" cacheLookup="products_product.id" cacheImpl="SortedMapBackedCache">
<field column="title" name="category"/>
</entity>
</entity>
</document>
</dataConfig>
Locally I've have approx 2K rows, which is indexed properly, and all the child entity show up.
On server, the fields from the child entities are not showing up i.e source, category and currency. The server has approx 6M rows, its a silly doubt but I hope that memory is not the issue. My server is running on m1.medium EC2 instance, Ubuntu 12.04LTS.
Thanks in Advance :)

Static field for document in Data Import Handlerfor Solr

Im making an index in solr from db in the following way:
<document name="Index">
<entity name="c" query="SELECT * FROM C">
<field column="Name" name="name"/>
</entity>
<entity name="p" query="SELECT * FROM P">
<field column="Name" name="name"/>
</entity>
</document>
Is it possible to have a static field that is set for each row that signify what type is returned to client so that one can make a call to the right database table based on that information from the json result?
That is a field that has no column in the table
<field name="id" value="1"/>
Or is there another way to solve this?
<document name="Index">
<entity name="c" transformer="TemplateTransformer" query="SELECT * FROM C">
<field column="Name" name="name"/>
<field column="id" template="1"/>
</entity>
<entity name="p" transformer="TemplateTransformer" query="SELECT * FROM P">
<field column="Name" name="name"/>
<field column="id" template="1"/>
</entity>
</document>
You can add a column to your SQL query that contains static data like this:
<document name="Index">
<entity name="c" query="SELECT *, 'foo' as NameFromC FROM C">
<field column="NameFromC" name="name"/>
</entity>
<entity name="p" query="SELECT *, 'bar' as NameFromP FROM P">
<field column="NameFromP" name="name"/>
</entity>
</document>
If you try to add a field with only name and template attributes, Solr will throw an error saying Field must have a column attribute.

Extract file name (without extension) while indexing using Data Import Handler in Solr

Im successfully able to index pdf,doc,ppt,etc files using the Data Import Handler in solr 4.3.0 .
My data-config.xml looks like this -
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:\Users\aroraarc\Desktop\Impdo"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)|(pptx)|(xls)|(xlsx)|(txt)" onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" />
<field column="file" name="fileName"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
However in the fileName field i want to insert the pure file name without the extension. Eg - Instead of 'HelloWorld.txt' I want only 'HelloWorld' to be inserted in the fileName field. How do I achieve this?
Thanks in advance!
Check ScriptTransformer to replace or change the value before it is indexed.
Example -
Data Config - Add custom field -
<script><![CDATA[
function changeFileName(row){
var fileName= row.get('fileName');
// Replace or remove the extension .. e.g. from last index of .
file_name_new = file_name.replace ......
row.put(fileName, row.get('file_name_new'));
return row;
}
]]></script>
Entity mapping -
<entity name="f" transformer="script:changeFileName" ....>
......
</entity>

Resources