Let me preface by mentioning that I've been through everything I could find about this topic including the Solr docs and all of the SO questions.
I have a Solr instance that I've setup with a Data Import Hanlder to pull in data from MSSQL using the JDBC driver. The data comes in, but it isn't structured as I'd expect based on the Solr DIH documentation
<document>
<entity>
<entity />
</entity>
</document>
I've tried all the attributes like rootEntity, flatten, using CachedSqlProvider, etc. With multiValued="True" The result ends up
docs [
{
recordId: '1234',
name: 'whatever'
subrows_col1: ['x','y','z']
subrows_col2: ['a','b','c']
}
]
When I'm looking for
docs [
{
recordId: '1234',
name: 'whatever'
subrows: [{
col1: 'x',
col2: 'a'
},
{
col1: 'y',
col2: 'b'
},
{
col1: 'z',
col2: 'c'
}]
} ]
I've seen the block-join stuff, but I'm confused as to where it goes. I added
<add>
<doc>
<field />
<doc>
<field />
</doc>
<doc>
</add>
to the DIH requestHandler, but it did nothing. I added it to the /update requestHandler and I got an error. I have no clue where that is supposed to go. Does it only work during a query or is it only for when you push data to solr via /update?
Where do I define the structure for the document? I tried nested fields in the schema, entities in the DIH config and the block-join stuff in the requestHandlers. nothing has worked yet.
Obviously I'm missing something.
Indexing nested document in DIH is finally supported from Solr 5.1 onwards.
https://issues.apache.org/jira/browse/SOLR-5147
Simply adding child=true to the child entity, then Solr DIH will automagically indexes as child document.
Example taken from JIRA (in the link above) :
<document>
<entity name='PARENT' query='select * from PARENT'>
<field column='id' />
<field column='desc' />
<field column='type_s' />
<entity child='true' name='CHILD' query="select * from CHILD where parent_id='${PARENT.id}'">
<field column='id' />
<field column='desc' />
<field column='type_s' />
</entity>
</entity>
</document>
I've also decompiled DocBuilder.class in solr-dataimporthandler-5.3.0.jar, found this code snippet : -
if (doc != null) {
if (epw.getEntity().isChild())
{
childDoc = new DocWrapper();
handleSpecialCommands(arow, childDoc);
addFields(epw.getEntity(), childDoc, arow, vr);
doc.addChildDocument(childDoc);
}
else
{
handleSpecialCommands(arow, doc);
addFields(epw.getEntity(), doc, arow, vr);
}
}
Noticed that if epw.getEntity().isChild() will return true if child="true" is set, thus it's creating a new DocWrapper and add as child document instead of simply adding the entity as a bunch of new fields.
DIH does not produce nested documents. Solr supports them, but DIH can't yet generate them.
The nested entities in DIH is to be able to merge sources and to be able to create entities based on iteration from a different source. E.g. if the outer entity reads a file for file names and inner entity loads content from those files with each file getting its own record.
You may want to move your nested object code into the client with SolrJ for now.
Related
TLDR
How do I configure solr Data Import Handler so it will import html similar to solr's "post" utility ?
Context
We're doing a small project where code will export a set pages from wiki/confluence to 'straight html' (for availability in a DR data center--straight html pages will not depend on a database, etc)
We want to index the html pages in solr.
We "have it working" using the solr-shipped "post utility"
post -c OPERATIONS -recursive -0 -host solr $(find . -name '*.html')
This is fine.....However, we would like to leverage the Data Import Handler (DIH), i.e. replace the shell command with a single http call to the DIH endpoint ('/dataimport')
Question
How do I configure the tika "data config xml" file to get "similar functionality" as the solr "post command" ?
when I configure with data-config.xml, solr document only ends up with an "id" and "version" fields (i.e. where id is the untokenized file name)
correction: i had originally wrote '"id" and "title" field..."'
"id":"database_operations_2019.html",
"_version_":1650836000296927232},
however when I use "bin/post" the document has these fields, i.e. including tokenized title:
"id":"/usr/local/html/OPERATIONS_2019_1119_1500/./database_operations_2019.html",
"stream_size":[54115],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"dc_title":["Database Operations 2019 Guidebook"],
"content_encoding":["UTF-8"],
"content_type_hint":["text/html; charset=UTF-8"],
"resourcename":["/usr/local/html/OPERATIONS_2019_1119_1500/./database_operations_2019.html"],
"title":["Database Operations 2019 Guidebook"],
"content_type":["text/html; charset=UTF-8"],
"_version_":1650834641083432960},
Some Points
I've tried RTM'ing, but do not follow how "field" maps to the "html body"
Parsing a directory-full-ofHTML is a circa-1999 problem, so I don't expect a lot of people
I've looked at the SimplePostTool.java (implementation of bin/post)...no real anwer.
Data Config Xml File
<dataConfig>
<dataSource type="BinFileDataSource"/>
<document>
<entity name="file" processor="FileListEntityProcessor"
dataSource="null"
htmlMapper="true"
format="html"
baseDir="/usr/local/var/www/confluence/OPERATIONS"
fileName=".*html"
rootEntity="false">
<field column="file" name="id"/>
<entity name="html" processor="TikaEntityProcessor"
url="${file.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
I ended up writing a few lines of code to parse the html files (jsoup) and ditched the solr data import handler (DIH).
Very straightforward using Spring and solr and jsoup html parser.
One caveat: my java "bean" object to store the solr fields needed a "text" field for the out-of-the-box default-search-field to work (i.e. with the solr docker instance)
This question is similar to Solr doesn't overwrite - duplicated uniqueKey entries, but I am in a situation where I have a large body of existing documents that have already been added to the collection with no child documents, and I am using (standalone not cloud) Solr 6.4 rather than 5.3.1. We recently enabled child documents so that we could store richer data.
We use SolrJ to load data into and query Solr, but to isolate the issue we're seeing, I used the command line Solr post tool to upload the following document:
<add>
<doc>
<field name="id">1</field>
<field name="solr_record_type">1</field>
<field name="title">Fabulous Book</field>
<field name="author">Angelo Author</field>
</doc>
</add>
Search results were as expected:
Using q=id:1 and
fl=id,title,index_date,[child parentFilter="solr_record_type:1"]
"response":{"numFound":1,"start":0,"docs":[
{
"id":"1",
"title":"Fabulous Book",
"index_date":"2019-01-16T23:06:57.221Z"}]
}
Then I updated the document by posting the following:
<add>
<doc>
<field name="id">1</field>
<field name="solr_record_type">1</field>
<field name="title">Fabulous Book</field>
<field name="author">Angelo Author</field>
<doc>
<field name="id">1-1</field>
<field name="solr_record_type">2</field>
<field name="contributor_name">Polly Math</field>
<field name="contributor_type">3</field>
</doc>
</doc>
</add>
Then, repeating my search, I got the following duplicate result, searching on the unique id field, which is undesirable.
"response":{"numFound":2,"start":0,"docs":[
{
"id":"1",
"title":"Fabulous Book",
"index_date":"2019-01-16T23:06:57.221Z",
"_childDocuments_":[
{
"id":"1-1",
"solr_record_type":2,
"contributor_name":"Polly Math",
"contributor_type":3,
"index_date":"2019-01-16T23:09:29.142Z"}]},
{
"id":"1",
"title":"Fabulous Book",
"index_date":"2019-01-16T23:09:29.142Z",
"_childDocuments_":[
{
"id":"1-1",
"solr_record_type":2,
"contributor_name":"Polly Math",
"contributor_type":3,
"index_date":"2019-01-16T23:09:29.142Z"}]}]
}
Going the other way, if I start with a document that was loaded initially with a child document, like the following:
<add>
<doc>
<field name="id">2</field>
<field name="solr_record_type">1</field>
<field name="title">Wonderful Book</field>
<field name="author">Andy Author</field>
<doc>
<field name="id">2-1</field>
<field name="solr_record_type">2</field>
<field name="contributor_name">Polly Math</field>
<field name="contributor_type">3</field>
</doc>
</doc>
</add>
And then I update it with a document with no children:
<add>
<doc>
<field name="id">2</field>
<field name="solr_record_type">1</field>
<field name="title">Wonderful Book</field>
<field name="author">Andy Author</field>
</doc>
</add>
The result still has the child:
"response":{"numFound":1,"start":0,"docs":[
{
"id":"2",
"title":"Wonderful Book",
"index_date":"2019-01-16T23:09:39.389Z",
"_childDocuments_":[
{
"id":"2-1",
"title_id":2,
"title_instance_id":2,
"solr_record_type":2,
"contributor_name":"Polly Math",
"contributor_type":3,
"index_date":"2019-01-16T23:07:04.861Z"}]}]
}
This is strange because if I update a document with 2 child documents with a replacement document with only 1 child document, it does drop one child document. But in this case, it is not dropping the child document.
Updates of documents with no child documents that don't add child documents, and updates of documents with child documents that don't remove all child documents both seem to work as I'd expect.
I have a large body of existing documents that don't have children, which I may be adding children to, and eventually I may have a lot of child-having documents that might drop their children. Given that, what is the best way to update these records without generating duplicate records or losing updates?
I would strongly advise avoiding Solr parent/child relationships. We decided to use them in Solr 5.3.1 and it turns out that although much of the functionality is there, there are a number of nasty bugs present in Solr since 4.x that remain unfixed including
SOLR-6096: Support Update and Delete on nested documents
SOLR-5211: updating parent as childless makes old children orphans (UPDATE: fixed in 8.0)
SOLR-6596: Atomic update and adding child doc not working together
SOLR-5772: duplicate documents between solr "block join" documents and "normal" document
SOLR-10030: SolrClient.getById() method in Solrj doesn't retrieve child documents
For those reasons, if at all possible, I strongly recommend AVOID using child documents. Even if those issues don't hit you now they will in the future at some point and it's clear, given that they have not been fixed in 3 to 4 major versions, that there is no real support in the product for child documents. Sorry to be the bearer of bad news but hopefully someone can learn from our experience.
I am trying to implement delta-import in solr indexing its working fine,in case when i am indexing data from database.But i want to implement it on filebased datasource.
My data-config.xml file is like
dataSource type="com.solr.datasource.DataSource" name="SuggestionsFile"/>
<document name="suggester">
<entity name="file" dataSource="SuggestionsFile">
<field column="suggestion" name="suggestion" />
</entity>
and i am using DataImportHandler in solrconfig.xml file.i am not able to post my config file,i tried to post,but i don't know why not its showing.
My DataSource class read the text file and return list of data,that solr index .Its working fine in case of full-import but not working in case of delta-import.Pls suggest what else i need to do.
The FileDataSourceEntityProcessor supports filtering the list based on the "newerThan" attribute:
<entity
name="fileimport"
processor="FileListEntityProcessor"
newerThan="${dataimporter.last_index_time}"
.. other options ..
>
...
</entity>
There's a complete example available online.
I am using Solr for searching my corpus of web page data. My solr-indexer will create several fields and corresponding values. However some of these fields I want to update more often, like for example the number of clicks on that page. These fields need not be indexable and I don't need to perform a search on these field values. However I do want to fetch them and update them often.
I am a newbie in solr so a more descriptive answer with perhaps some running example/code would help me better.
If you are on Solr 4+, yes you can push a Partial Update to Solr index.
For partial update, all fields in your schema.xml need to be stored.
This is how your fields section should look like:
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true" />
<field name="body" type="text_general" indexed="true" stored="true"/>
<field name="clicks" type="integer" indexed="true" stored="true" />
</fields>
Now when you send a partial update to one of the fields, eg: in your case the "clicks"; in the background Solr will go and fetch values for all other fields for that document, such as title, description, body, delete old document and will push new updated document to Solr index.
localhost:8080/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"1","clicks":{"set":100}}]
Here is a good documentation on partial updates: http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Sample SOLR- partial update code:
Prerequisites: The fields need to be stored.
You need to configure update log path under direct update handler
<updateHandler class="solr.DirectUpdateHandler2">
<!-- Enables a transaction log, used for real-time get, durability, and
and solr cloud replica recovery. The log can grow as big as
uncommitted changes to the index, so use of a hard autoCommit
is recommended (see below).
"dir" - the target directory for transaction logs, defaults to the
solr data directory. -->
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
</updateHandler>
Code:
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrInputDocument;
public class PartialUpdate {
public static void main(String args[]) throws SolrServerException,
IOException {
SolrServer server = new HttpSolrServer("http://localhost:8080/solr");
SolrInputDocument doc = new SolrInputDocument();
Map<String, String> partialUpdate = new HashMap<String, String>();
// set - to set a field.
// add - to add to a multi-valued field.
// inc - to increment a field.
partialUpdate.put("set", "peter"); // value that need to be set
doc.addField("id", "122344545"); // unique id
doc.addField("fname", partialUpdate); // value of field fname corresponding to id 122344545 will be set to 'peter'
server.add(doc);
}
}
I don't know java, I don't know XML, and I don't know Lucene. Now that that's out of the way. I have been working to create a little project using apache solr/lucene. My problem is that I am unable to index the xml files. I think I understand how its supposed to work but I could be wrong. I am not sure what information is required for you to help me so I will just post the code.
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<!-- This first entity block will read all xml files in baseDir and feed it into the second entity block for handling. -->
<entity name="AMMFdir" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor"
fileName="^*\.xml$" recursive="true"
baseDir="C:\Documents and Settings\saperez\Desktop\Tomcat\apache-tomcat-7.0.23\webapps\solr\data\AMMF_New"
>
<entity
processor="XPathEntityProcessor"
name="AMMF"
pk="AcquirerBID"
datasource="AMMFdir"
url="${AMMFdir.fileAbsolutePath}"
forEach="/AMMF/Merchants/Merchant/"
transformer="DateFormatTransformer, RegexTransformer"
>
<field column="AcquirerBID" xpath="/AMMF/Merchants/Merchant/AcquirerBID" />
<field column="AcquirerName" xpath="/AMMF/Merchants/Merchant/AcquirerName" />
<field column="AcquirerMerchantID" xpath="/AMMF/Merchants/Merchant/AcquirerMerchantID" />
</entity>
</entity>
</document>
Example xml file
<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
<Merchants Count="153">
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029881</AcquirerBID>
<AcquirerName>WorldPay</AcquirerName>
<AcquirerMerchantID>*</AcquirerMerchantID>
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029882</AcquirerBID>
<AcquirerName>WorldPay2</AcquirerName>
<AcquirerMerchantID>Hello World!</AcquirerMerchantID>
</Merchant>
</Merchants>
I have this in schema.
<field name="AcquirerBID" type="string" indexed="true" stored="true" required="true" />
<field name="AcquirerName" type="string" indexed="true" stored="true" />
<field name="AcquirerMerchantID" type="string" indexed="true" stored="true"/>
I have this in config.
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" default="true" >
<lst name="defaults">
<str name="config">AMMFconfig.xml</str>
</lst>
</requestHandler>
The sample XML is not well formed. This might explain errors indexing the files:
$ xmllint sample.xml
sample.xml:13: parser error : expected '>'
</Merchants>
^
sample.xml:14: parser error : Premature end of data in tag Merchants line 3
sample.xml:14: parser error : Premature end of data in tag AMMF line 2
Corrected XML
Here's what I think your sample data should look like (Didn't check the XSD file)
<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
<Merchants Count="153">
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029881</AcquirerBID>
<AcquirerName>WorldPay</AcquirerName>
<AcquirerMerchantID>*</AcquirerMerchantID>
</Merchant>
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029882</AcquirerBID>
<AcquirerName>WorldPay2</AcquirerName>
<AcquirerMerchantID>Hello World!</AcquirerMerchantID>
</Merchant>
</Merchants>
</AMMF>
Alternative solution
I know you said you're not a programmer, but this task is significantly simpler, if you use the solrj interface.
The following is a groovy example which indexes your example XML
//
// Dependencies
// ============
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument
#Grapes([
#Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
])
//
// Main
// =====
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
def i = 1
new File(".").eachFileMatch(~/.*\.xml/) {
it.withReader { reader ->
def ammf = new XmlSlurper().parse(reader)
ammf.Merchants.Merchant.each { merchant ->
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", i++)
doc.addField("bid_s", merchant.AcquirerBID)
doc.addField("name_s", merchant.AcquirerName)
doc.addField("merchantId_s", merchant.AcquirerMerchantID)
server.add(doc)
}
}
}
server.commit()
Groovy is a Java scripting language that does not require compilation. It would be just as easy to maintain as a DIH config file.
To figure out how DIH XML import works, I suggest you first carefully read this chapter in DIH wiki: http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example.
Open the Slashdot link http://rss.slashdot.org/Slashdot/slashdot in your browser, then right click on the page and select View source. There's the XML file used in this example.
Compare it with XPathEntityProcessor configuration in DIH example and you'll see how easy it is to import any XML file in Solr.
If you need more help just ask...
Often the best thing to do is NOT use the DIH. How hard would it be to just post this data using the API and a custom script in a language you DO know?
The benefit of this approach is two-fold:
You learn more about your system, and know it better.
You don't spend time trying to understand the DIH.
The downside is that you're re-inventing the wheel a bit, but the DIH is quite a thing to understand.