How to generate an Id using DataImportHandler? - solr

I'm new in Solr and I'm struggling to import some XML Data which does not contain a ID field, although It's required as it says my schema.xml:
An XML example:
<results>
<estacions>
<estacio id="72400" nom="Aeroport"/>
<estacio id="79600" nom="Arenys de Mar"/>
...
</estacions>
</results>
Schema.xml:
<uniqueKey>id</uniqueKey>
At this point, I need to import this xml from http fetch, then I use DataimportHandler.
This is my data-config.xml
<dataConfig>
<dataSource type="URLDataSource" />
<document>
<entity name="renfe"
url="http://host_url/myexample.xml"
processor="XPathEntityProcessor"
forEach="/results/estacions/estacio"
transformer="script:generateCustomId">
<field column="idestacio" xpath="/results/estacions/estacio/#id" commonField="true" />
<field column="nomestacio" xpath="/results/estacions/estacio/#nom" commonField="true" />
</entity>
</document>
Then, it seems to work properly, but I got the following error:
org.apache.solr.common.SolrException: [doc=null] missing required field: id
This makes me think that I should generate an automatic id while importing, and by using the data-config.xml, but I don't reach to see how to do it.
How should I do? Using a ScriptTransformer? Any idea is grateful
And another question: Can I force a value during the import ?
For ex: <field column="site" value="estacions"/> (obviously this does not work)

You can use code below to generate ID:
<dataConfig>
<script><![CDATA[
id = 1;
function GenerateId(row) {
row.put('id', (id ++).toFixed());
return row;
}
]]></script>
<dataSource type="URLDataSource" />
<document>
<entity name="renfe"
url="http://host_url/myexample.xml"
processor="XPathEntityProcessor"
forEach="/results/estacions/estacio"
transformer="script:GenerateId">
<field column="idestacio" xpath="/results/estacions/estacio/#id" commonField="true" />
<field column="nomestacio" xpath="/results/estacions/estacio/#nom" commonField="true" />
</entity>
</document>

Related

DeltaImport fetches all the data

I'm indexing data from database. I'm using delta import to fetch the recently updated data. However, I find that it is fetching the whole data twice and processing it once though the changes are applicable to only one row.
My config.xml where deltaquery is given:
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.github.cassandra.jdbc.CassandraDriver" url="jdbc:c*://127.0.0.1:9042/test" autoCommit="true" rowLimit = '-1' batchSize="-1"/>
<document name="content">
<entity name="test" query="SELECT * from person" deltaImportQuery="select * from person where seq=${dataimporter.delta.seq}" deltaQuery="select seq from person where last_modified > '${dataimporter.last_index_time}' ALLOW FILTERING" autoCommit="true">
<field column="seq" name="id" />
<field column="last" name="last_s" />
<field column="first" name="first_s" />
<field column="city" name="city_s" />
<field column="zip" name="zip_s" />
<field column="street" name="street_s" />
<field column="age" name="age_s" />
<field column="state" name="state_s" />
<field column="dollar" name="dollar_s" />
<field column="pick" name="pick_s" />
</entity>
</document>
</dataConfig>
There are about 2100000 rows. So it always cause a large memory consumption resulting in Running Out of Memory. What could be the problem? Or does it work in this way only?
If solr is running out of memory then it is time to add more memory to the solr box. Adding more RAM will help alleviate the issue.

SOLR : get data from specific entity

I have multiple entities having the same field name i.e name
now i want while running http://localhost:8983/solr/test/select?q=name:Ch*&wt=json
data only from a particular entity, right now getting data from all entities
is there any way to do this ......
<dataConfig>
<dataSource name="fds" encoding="UTF-8" baseUrl="file://localhost/tmp/test/" type="URLDataSource" />
<document>
<entity name="tags"
processor="LineEntityProcessor"
dataSource="fds"
url="tags.csv"
rootEntity="true"
transformer="RegexTransformer" >
<field column="rawLine"
regex="^(.*),(.*),(.*)$"
groupNames="id,name," />
</entity>
<entity name="status"
processor="LineEntityProcessor"
dataSource="fds"
url="status.csv"
rootEntity="true"
transformer="RegexTransformer" >
<field column="rawLine"
regex="^(.*),(.*),(.*)$"
groupNames=",name," />
</entity>

I want to use multiple datasources in DataImporthandler in Solr and pass URL value in child entity after querying database in parent entity

I want to use multiple datasources in DataImporthandler in Solr and pass URL value in child entity after querying database in parent entity.
Here is my rss-data-config file:
<dataConfig>
<dataSource type="JdbcDataSource" name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/HCDACoreDB"
user="root" password="CDA#318"/>
<dataSource type="URLDataSource" name="ds-url"/>
<document>
<entity name="feeds" query="select f.feedurl, f.feedsource, c.categoryname from feeds f, category c where f.feedcategory = c.categoryid">
<field column="feedurl" name="url" dataSource="ds-db"/>
<field column="categoryname" name="category" dataSource="ds-db"/>
<field column="feedsource" name="source" dataSource="ds-db"/>
<entity name="rss"
transformer="HTMLStripTransformer"
forEach="/RDF/channel | /RDF/item"
processor="XPathEntityProcessor"
url="${dataimporter.functions.encodeUrl(feeds.feedurl)}" >
<field column="source-link" dataSource="ds-url" xpath="/rss/channel/link" commonField="true" />
<field column="Source-desc" dataSource="ds-url" xpath="/rss/channel/description" commonField="true" />
<field column="title" dataSource="ds-url" xpath="/rss/channel/item/title" />
<field column="link" dataSource="ds-url" xpath="/rss/channel/item/link" />
<field column="description" dataSource="ds-url" xpath="/rss/channel/item/description" stripHTML="true"/>
<field column="pubDate" dataSource="ds-url" xpath="/rss/channel/item/pubDate" />
<field column="guid" dataSource="ds-url" xpath="/rss/channel/item/guid" />
<field column="content" dataSource="ds-url" xpath="/rss/channel/item/content" />
<field column="author" dataSource="ds-url" xpath="/rss/channel/item/creator" />
</entity>
</entity>
</document>
What I am doings is in first entity named feeds I am querying database and want to use the feedurl as the URL for the child entity names rss.
The error I get when I run the dataimport is:
java.net.MalformedURLException: no protocol: nullselect f.feedurl, f.feedsource, c.categoryname from feeds f, category c where f
.feedcategory = c.categoryid
the URL us NULL meaning its not assigning the feedurl to URL.
Any suggestion on what I am doing wrong?
Here's an example:
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource name="db1" ... />
<dataSource name="db2"... />
<document>
<entity name="outer" dataSource="db1" query=" ... ">
<field column="id" />
<entity name="inner" dataSource="db2" query=" select from ... where id = ${outer.id} ">
<field column="innercolumn" splitBy=":::" />
</entity>
</entity>
</document>
the idea is to have one definition of the entity nested that does the extra query to the other database.
you can access the parent entity fields like this ${outer.id}

splitting multivalued field while importing data into solr

I'm having a bit of trouble getting my head around solr 3.4 when it comes to multiple values. I have this DIH:
<dataConfig>
<dataSource type="JdbcDataSource" name="********" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/lokal" user="****" password="******" />
<document>
<entity name="Search" transformer="RegexTransformer" query="select b_id, b_navn, b_cats, b_info, b_keyword, b_critical, b_geo, b_adress from searchbiz">
<field column="b_id" name="b_id" />
<field column="b_info" name="b_info" />
<field column="b_cats" name="b_cats" splitBy=","/>
</entity>
</document>
</dataConfig>
Now, my problem is when this b_cats is index'ed Im getting this result :
<arr name="b_adress">
<str>place1, place2</str>
</arr>
But I thought it should be one node on each.
When I try to facet using this field, I'm getting "place1, place2" = xx result, instead of place1 = xx, and place2 xx.
Can anybody please point me in the right direction on this problem?
Thanks ;)
Here is the solution:
<dataConfig>
<dataSource type="JdbcDataSource" name="********" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/lokal" user="****" password="******" />
<document>
<entity name="Search" transformer="RegexTransformer" query="select b_id, b_navn, b_cats, b_info, b_keyword, b_critical, b_geo, b_adress from searchbiz">
<field column="b_id" name="b_id" />
<field column="b_info" name="b_info" />
<field column="b_cats" splitBy="," sourceColName="b_cats"/>
</entity>
</document>

DynamicField names from SQL value

I have this "catch all" field in my schema.xml:
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
In the example below lets say i have a table that has 2 fields: "custom_value" and "custom_key" with these values:
custom_key: "mykey"
custom_value: "myvalue"
My Goal is to index a document that has a field called "mykey" and the value "myvalue". How can i do that?
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/MY_DB"
user="MYUSER"
password="MYPASS"
batchSize="-1"/>
<document>
<entity name="article" query="SELECT id, custom_key, custom_value FROM mytable">
<field column="id" name="id"/>
<field column="custom_value" name=":::WHAT TO PUT HERE?:::_s"/>
</entity>
</document>
Found a (hacky?) solution, that works for my purposes, i will not mark this question as answered for a few days, incase someone comes up with a cleaner/better solution.
<dataConfig>
<script><![CDATA[
function insertVariants(row) {
row.put(row.get('custom_key') + '_custom', row.get('custom_value'));
return row;
}
]]></script>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/MY_DB"
user="MYUSER"
password="MYPASS"
batchSize="-1"/>
<document>
<entity name="article" query="SELECT id, custom_key, custom_value FROM mytable" transformer="script:insertVariants">
<field column="id" name="id"/>
</entity>
</document>
</dataConfig>

Resources