problem with uniqueKey in solr - solr

I am new to solr, while creating the indexes i am attaching string to database table id
my field in schema.xml as follows
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<uniqueKey>id</uniqueKey>
and i am passing 'GROUP1' for id, but it is storing [B#1e090ee like this.
How could i store the same value(GROUP1) instead of [B#1e090ee ?
Please help

Is group_id string or some numeric data type?
If it's not string you need to cast it to char before concatenation with appropriate encoding.
Also add encoding (that matches your MySQL db encoding) parameter to dataSource tag, like this:
<dataSource
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://host/dbname"
batchSize="-1"
user="username"
password="password"
readOnly="true"
autoCommit="true"
encoding="UTF-8" />

Which DB are you using?
Do you see the correct values when you execute your query directly in your db?
IMHO, the problem has to be either with DataImportHandler or you actually have values like that ([B#1e090ee) in your group_id field.
Have you checked that encoding param in your dataCofig's dataSource is the same as your db's encoding?
Can you post your dataConfig file?

#mbonaci
I am using mysql database.
when i execute the same query, the results are coming fine
the following is my data config file
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://host/dbname" batchSize="-1" user="username" password="password" readOnly="true" autoCommit="true" />
<document name="products">
<entity name="item" query="select group_id,group_title,description,DATE_FORMAT(created_date, '%Y-%m-%dT%H:%i:%sZ') as createdDate,group_status,CONCAT('GROUP',group_id) as id,'GROUP' as itemtype from collaboration_groups where group_status=1 ">
<field column="id" name="id" />
<field column="group_id" name="itemid" />
<field column="itemtype" name="itemtype" />
<field column="group_title" name="fullName" />
<field column="description" name="description"/>
<field column="createdDate" name="createdDate"/>
</entity>
</document>
</dataConfig>

Related

DataImportHandlerException: Unable to execute query: select <column> from <table_name> Processing Document

I am trying to import data from relational db to Solr for indexing.
Here is my data-config.xml :
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://link/db"
user="user"
password="pass"/>
<document>
<entity name="locations_countries" query="select name from locations_countries">
</entity>
</document>
</dataConfig>
Managed-schema.xml
<uniqueKey>name</uniqueKey>
<field name="name" type="string" indexed="true" stored="true"/>
I created core and than try to import the solr core from localhost but no column are fetched and when I checked the log I get following error:
I tried everything but nothing work and also delete and recreated the core but again same error.

More than one transformer in a single entity using DataImportHandler

I am using DataImportHandler for indexing data in Solr.
I am retrieving data from three columns from AUTO table in database where two columns namely TOPIC and PARTS have data of type 'CLOB' and column DATE has oracle timestamp which holds created date.
The problem is in my data-config file where I need to transform the clob data to string and also date to the UTC that Solr uses.
So I need two transformers i.e ClobTransformer and DateFormatTransformer.
I am wondering how will I use both the transformers in single entity.
here is my data-config file
<dataConfig>
<dataSource name="ds1" type="JdbcDataSource"
driver="oracle.jdbc.OracleDriver"
url="....."
user="....."
password="...."/>
<document name="doc">
<entity name="ent"
query="Select
auto.ID,
auto.Topic as Topic,
auto.Parts as Parts,
to_date(to_char(auto.Date, 'yyyy-MM-dd HH:MI:SS'), 'YYYY-MM-DD HH:MI:SS') AS Date,
From auto
order by auto.Date DESC"
dataSource="ds1" transformer="DateFormatTransformer">
<field column="ID" name="id"/>
<field column="TOPIC" name="topic" clob="true"/>
<field column="PARTS" name="parts" clob="true"/>
<field column="DATE" name="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd HH:mm:ss" locale="en"/>
</entity>
</document>
</dataConfig>
Above I have used only DateFormatTransformer.
Any help would be much appreciated.
Ok I came to know how its done. Just specifying the particular transformers using commas in the 'transformer' section of the tag like this:
<dataConfig>
<dataSource name="ds1" type="JdbcDataSource"
driver="oracle.jdbc.OracleDriver"
url="....."
user="....."
password="...."/>
<document name="doc">
<entity name="ent"
query="Select
auto.ID,
auto.Topic as Topic,
auto.Parts as Parts,
to_date(to_char(auto.Date, 'yyyy-MM-dd HH:MI:SS'), 'YYYY-MM-DD HH:MI:SS') AS Date,
From auto
order by auto.Date DESC"
dataSource="ds1" transformer="ClobTransformer,DateFormatTransformer">
<field column="ID" name="id"/>
<field column="TOPIC" name="topic" clob="true"/>
<field column="PARTS" name="parts" clob="true"/>
<field column="DATE" name="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd HH:mm:ss" locale="en"/>
</entity>
</document>
</dataConfig>
I have used two transformers, transformer="ClobTransformer,DateFormatTransformer"

Only index documents that contain a specific string in solr

How to index documents that contain specific string in solr? This is my current dataimporthandler
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page"
processor="XPathEntityProcessor"
stream="true"
forEach="/mediawiki/page/"
url="pages.xml"
transformer="RegexTransformer"
>
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="text" regex="\{\{PersonData" xpath="/mediawiki/page/revision/text" />
</entity>
</document>
</dataConfig>
I only want to index if the text field contain {{PersonData , but the above imports everything . Should this be specified in import handler or schema?
You need to do something like this:
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
In this case documents matching the specified regex are skipped, ie. articles that are "redirects" to other articles are skipped here.
Detailed documentation here:
http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor
So for yours you need to find a way to say skip all documents where "PersonData" data is NOT in "text" column.
Look specifically at : "Example: Indexing wikipedia" part of http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

Tika fetches the binary content stored in database but does not indexes it

I am trying to parse the binary content data stored in database in table document_attachment in column file_data and trying to index the same so that it's content becomes available for searching using Solr.
When I run the indexer it fetches the rows which is twice in number to the rows returned by running the query in entity named "dcs" and throws no errors or exceptions. it however does not indexes the binary content(the field that I associate with tika despite of fetching it from the table).
I am using apache-solr-3.6.1 and Tika 1.0
My configuration files look something like :
data-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/espritkm_1?zeroDateTimeBehavior=convertToNull"
user="root"
password=""
autoCommit="true" batchSize="-1"
convertType="false"
name="test"
/>
<dataSource name="fieldReader" type="FieldStreamDataSource" />
<document name="items">
<entity name="dcs"
query="SELECT 222000000000000000+d.id AS common_id_attr,d.id AS id,UNIX_TIMESTAMP(d.created_at) AS date_added,d.file_name as common1, d.description as common2, d.file_mime_type as common3, 72 as common4,(Select group_concat(trim(tags) ORDER BY trim(tags) SEPARATOR ' | ') from tags t where t.type_id = 72 AND t.feature_id = d.id group by t.feature_id) as common5,d.created_by as common6, df.name as common7,CONCAT(d.file_name,'.',d.file_mime_type) as common8,'' as common9,(Select da.file_data from document_attachment da where da.document_id = d.id) as text FROM document d LEFT JOIN document_folder df ON df.id = d.document_folder_id WHERE d.is_deleted = 0 and d.parent_id = 0 " dataSource="test" transformer="TemplateTransformer">
<field column="common_id_attr" name="common_id_attr" />
<field column="id" name="id" />
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="dcs.text" format="text" pk="dcs.id" >
<field column="text" name="text" />
</entity>
</entity>
schema.xml
<schema>
<fields>
<field name="common_id_attr" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="true" multiValued="true"/>
</fields>
<uniqueKey>common_id_attr</uniqueKey>
<solrQueryParser defaultOperator="OR"/>
<defaultSearchField>text</defaultSearchField>
</schema>
Though the rows it fetches is double the number of documents counting the rows of tika as separate (I don't understand why?). It does not indexes binary content.
I am stuck in this problem from long. Can someone please help
I was able to index the documents using Apache Solr version 3.6.2. I have described the steps here:
http://tuxdna.wordpress.com/2013/02/04/indexing-the-documents-stored-in-a-database-using-apache-solr-and-apache-tika/
I think it should be doable in 3.6.1 as well. I was only impatient to search for a tarball of version 3.6.1 when only 3.6.2 was avaiable from the official site.
I hope that helps.

solr dataimport error: Indexing failed. Rolled back all changes

When I run the "Full import with cleaning" command, error is "Indexing failed. Rolled back all changes"
My dataimport config file:
<dataConfig>
<dataSource type="JdbcDataSource" name="ds-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://my.ip/my_db" user="my_db_user" password="my_password" readOnly="True"/>
<document>
<entity name="videos" pk="ID" transformer="TemplateTransformer" dataSource="ds-1"
query="SELECT * FROM videos LIMIT 100">
<field column="id" name="unid" indexed="true" stored="true" />
<field column="title" name="baslik" indexed="true" stored="true" />
<field column="video_img" name="img" indexed="true" stored="true" />
</entity>
</document>
</dataConfig>
I kept receiving the same error message at some point in time.For me there were the following reasons:
bad connection string.
Bad driver (com.mysql.jdbc.Driver)
bad query
bad mapping of columns to solrfields ( I think it might be your problem too)
Make sure the name of the columns in the database is the same (case sensitive) as the name of the columns in SOLR. If not rename the colmuns name in the query:
select id as uniqueid, title as Tittle
or using the field element in the entity you defined like this:
<field column="ID" name="id" />
You are using the field element wrong. See here how you can use this element: http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml
If you can share other relevant data and logs we can give you more specific information.
Good luck.

Resources