How to store information in solr? - solr

I recently began to learn solr, for me some things remain incomprehensible, I will explain what I'm trying to do, please tell me which way to go.
I need a web application in which it will be possible to save data, some fields from which will be in the form of text, some in the form of a file, how to add fields in the form of text is understandable, it is impossible to add files, or their contents as text, in this case I do not know where to store the file itself?
If you need to find a file and it will be known only a couple of words from the entire file, I want all the files to appear in which there are these words, should I add a separate database in this case? If so, where to store the files? if not, the same question.
I would be very pleased and understandable to look at it on some example, maybe you have a link?

This is far too wide and non-specific to give an answer you can just implement; in general you'd submit the documents together with an id to Solr (through Tika in the Extracting Request Handler / Solr Cell).
The documents itself will have to be stored somewhere else, as Solr doesn't handle document storage for you. They can be stored on a cloud service, on a network drive or a local disk - this will depend on your web application.
Your application will then receive the file from the user, store a database row assigning the file to the user, store the file somewhere (S3/GoogleCloudStorage/Local path) under a well-known name (usually the id of the row from the database) and submit the content to Solr for indexing - together with metadata (such as the user id) and the file id.
Searching will then give you the id back and you can retrieve the document from wherever you stored it.

As MatsLindh already mentioned a approach to achieve what you are looking for.
Here are some step by which you can index the files with known location.
Update the solrConfig.xml with below lines
<!-- Load Data Import Handler and Apache Tika (extraction) libraries -->
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar"/>
<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar"/>
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">tika-data-config.xml</str>
</lst>
</requestHandler>
Create a file named tika-data-config.xml under the G:\Solr\TikaConf\conf folder. with below configuration. This location could be different for you.
<dataConfig>
<dataSource type="BinFileDataSource"/>
<document>
<entity name="file" processor="FileListEntityProcessor" dataSource="null"
baseDir="G:/Solr/solr-7.7.2/example/exampledocs" fileName=".*xml"
rootEntity="false">
<field column="file" name="id"/>
<entity name="pdf" processor="TikaEntityProcessor"
url="${file.fileAbsolutePath}" format="text">
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
Add the below fields in your schema.xml
<field name="text" type="text_general" indexed="true" stored="true" multiValued="false"/>
Update the solrConfig xml file as below in order to disable the schemaless mode
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
Go to the solr admin page and select the core you created and click on data import.
Once data is imported or indexed, you can verify the same by querying it.
If you file location is dynamic, means you are retrieving the file location from the database and then that would be your first entity which is retrieving the information from your database about the files metadata like id,name,author and the file path etc..In the second entity which is TikaEntityProcessor, pass the file path and get the content of the file indexed...

Related

Importing latitude and longitude into a location(LatLonPointSpatialField class) field in Solr

Alright, I am looking for general guidelines on how to import a CSV file containing the following fields
poi_name, latitude, longitude
into a Solr (7.x) core to perform geo queries? What is the right way to achieve this? I tried
using the bin/postimport creates a useless schema where all the fields are multivalued. Obviously no location field is being created.
doing the same but creating a schema for the 3 fields via the admin UI and I get "Document is missing mandatory uniqueKey field: id". I would like to get the functionality where the id is automatically populated with a random uuid.
and lastly and the most important is how to "compute" a LatLonPointSpatialField from the latitude and longitude. Via the UI there was no way to create a 4th field that utilizes other fields.
Do I really need to go trough the trouble of defining a DataImportHandler to do this or it is sufficient to create a schema for all this?
What if the latitude and longitude are already there and I am trying to update the schema with the location field at a later time?
Can't find a good example for doing this, however there is an old example where the location field is automatically composed if latitude and longitude have predefined names with a suffix something like location_1_coordinate and location_2_coordinate this seems silly!
Just conclude and aggregate the answer for anyone interested this is the solution I came to following MatsLindh suggestion. Context: CentOS 7 and Solr 7.5
Sample.csv content
name,lon,lat,
A,22.9308852,39.3724824
B,22.5094530,40.2725792
relevant portion of the schema (managed-schema file)
<fieldType name="location" class="solr.LatLonPointSpatialField" docValues="true"/>
...
<field name="lat" type="string" omitTermFreqAndPositions="true" indexed="true" required="true" stored="true"/>
<field name="location" type="location" multiValued="false" stored="true"/>
<field name="lon" type="string" omitTermFreqAndPositions="true" indexed="true" stored="true"/>
solrconfig.xml
<updateRequestProcessorChain name="uuid-location">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">lat</str>
<str name="dest">location</str>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">lon</str>
<str name="dest">location</str>
</processor>
<processor class="solr.ConcatFieldUpdateProcessorFactory">
<str name="fieldName">location</str>
<str name="delimiter">,</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
<lst name="defaults">
<str name="df">_text_</str>
<str name="update.chain">uuid-location</str>
</lst>
</initParams>
and to import the sample file into the core run the following in bash
/opt/solr/bin/post -c your_core_name /opt/solr/sample.csv
And if you wonder how to query that data use
http://localhost:8983/solr/your_core_name/select?&q=*:*&fq={!geofilt%20sfield=location}&pt=42.27,-74.91&d=1
where pt is the lat-long point and d is the distance in kilometers.
First - you'll have to define a location field. The schemaless mode is made for quick prototyping, if you need more specific fields (and be sure that the fields get the correct type in production), you'll have to configure them explicitly. Use the LatLonPointSpatialField type for this, and make it single valued.
First define the field type to use (these are adopted from the Schema API documentation):
curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field-type" : {
"name":"location_type",
"class":"LatLonPointSpecialField"
}' http://localhost:8983/solr/gettingstarted/schema
Then add a field with that type:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"location",
"type":"location_type",
"stored":true }
}' http://localhost:8983/solr/gettingstarted/schema
The two other issues can be fixed through a custom update chain (you provide the name of the chain as the update.chain URL parameter when indexing the document).
To automagically assign a guid to any indexed document, you can use the UUIDUpdateProcessorFactory. Give the field name (id) as the fieldName parameter.
To get the latitude and longitude concatenated to a single field with , as the separator, you can use a ConcatFieldUpdateProcessorFactory. The important thing here is that it concatenates a list of values given for a single valued field into a single value - it does not concatenate two different field names. To fix that we can use a CloneFieldUpdateProcessor to move both the latitude and longitude value into a separate field.
<updateRequestProcessorChain name="populate-location">
<processor class="solr.CloneFieldUpdateProcessorFactory">
<arr name="source">
<str>latitude</str>
<str>longitude</str>
</arr>
<str name="dest">location</str>
</processor>
<processor class="solr.ConcatFieldUpdateProcessorFactory">
<str name="delimiter">,</str>
</processor>
</updateRequestProcessorChain
If you add the location field later and already have the data in your database, this won't work. Solr won't touch data that has already been indexed, and you'll have to reindex to get your information processed and indexed the correct way. This is true regardless of how you get content into the location field.
The old example is probably the other way around - earlier you'd send a latlon pair, and it'd get indexed as two separate values - one for latitude and one for longitude - under the hood. You could probably hack around that by sending a single value for each, but it was really meant to work the other way around - sending one value and getting it indexed as two separate fields. Since the geospatial support in Lucene (and Solr) was just starting out, the already existing types were re-used instead of creating more dedicated types.

How does SOLR Cell add document content?

SOLR has a module called Cell. It uses Tika to extract content from documents and index it with SOLR.
From the sources at https://github.com/apache/lucene-solr/tree/master/solr/contrib/extraction , I conclude that Cell places the raw extracted text document text into a field called "content". The field is indexed by SOLR, but not stored. When you query for documents, "content" doesn't come up.
My SOLR instance has no schema (I left the default schema in place).
I'm trying to implement a similar kind of behavior using the default UpdateRequestHandler (POST to /solr/corename/update). The POST request goes:
<add commitWithin="60000">
<doc>
<field name="content">lorem ipsum</field>
<field name="id">123456</field>
<field name="someotherfield_i">17</field>
</doc>
</add>
With documents added in this manner, the content field is indexed and stored. It's present in query results. I don't want it to be; it's a waste of space.
What am I missing about the way Cell adds documents?
If you don't want your field to store the contents, you have to set the field as stored="false".
Since you're using the schemaless mode (there still is a schema, it's just generated dynamically when new fields are added), you'll have to use the Schema API to change the field.
You can do this by issuing a replace-field command:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field":{
"name":"content",
"type":"text",
"stored":false }
}' http://localhost:8983/solr/collection/schema
You can see the defined fields by issuing a request against /collection/schema/fields.
The Cell code indeed adds the content to the document as content, but there's a built-in field translation rule that replaces content with _text_. In the schemaless SOLR, _text_ is marked as not for storing.
The rule is invoked by the following line in the SolrContentHandler.addField():
String name = findMappedName(fname);
In the params object, there's a rule that fmap.content should be treated as _text_. It comes from corename\conf\solrconfig.xml, where by default there's the following fragment:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str> <!-- This one! -->
</lst>
</requestHandler>
Meanwhile, in corename\conf\managed_schema there's a line:
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
And that's the whole story.

How to handle solr delta-import in file based datasource

I am trying to implement delta-import in solr indexing its working fine,in case when i am indexing data from database.But i want to implement it on filebased datasource.
My data-config.xml file is like
dataSource type="com.solr.datasource.DataSource" name="SuggestionsFile"/>
<document name="suggester">
<entity name="file" dataSource="SuggestionsFile">
<field column="suggestion" name="suggestion" />
</entity>
and i am using DataImportHandler in solrconfig.xml file.i am not able to post my config file,i tried to post,but i don't know why not its showing.
My DataSource class read the text file and return list of data,that solr index .Its working fine in case of full-import but not working in case of delta-import.Pls suggest what else i need to do.
The FileDataSourceEntityProcessor supports filtering the list based on the "newerThan" attribute:
<entity
name="fileimport"
processor="FileListEntityProcessor"
newerThan="${dataimporter.last_index_time}"
.. other options ..
>
...
</entity>
There's a complete example available online.

Solr DataImportHandler - indexing multiple, related XML documents

Let's say I have two XML document types, A and B, that look like this:
A:
<xml>
<a>
<name>First Number</name>
<num>1</num>
</a>
<a>
<name>Second Number</name>
<num>2</num>
</a>
</xml>
B:
<xml>
<b>
<aKey>1</aKey>
<value>one</value>
</b>
<b>
<aKey>2</aKey>
<value>two</value>
</b>
</xml>
I'd like to index it like this:
<doc>
<str name="name">First Name</str>
<int name="num">1</int>
<str name="spoken">one</str>
</doc>
<doc>
<str name="name">Second Name</str>
<int name="num">2</int>
<str name="spoken">two</str>
</doc>
So, in effect, I'm trying to use a value from A as a key in B. Using DataImportHandler, I've used the following as my data config definition:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="document" transformer="LogTransformer" logLevel="trace"
processor="FileListEntityProcessor" baseDir="/tmp/somedir"
fileName="A.*.xml$" recursive="false" rootEntity="false"
dataSource="null">
<entity name="a"
transformer="RegexTransformer,TemplateTransformer,LogTransformer"
logLevel="trace" processor="XPathEntityProcessor" url="${document.fileAbsolutePath}"
stream="true" rootEntity="true" forEach="/xml/a">
<field column="name" xpath="/xml/a/name" />
<field column="num" xpath="/xml/a/num" />
<entity name="b" transformer="LogTransformer"
processor="XPathEntityProcessor" url="/tmp/somedir/b.xml"
stream="false" forEach="/xml/b" logLevel="trace">
<field column="spoken" xpath="/xml/b/value[../aKey=${a.num}]" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
However, I encounter two problems:
I can't get the XPath expression with the predicate to match any rows; regardless of whether I use an alternative like /xml/b[aKey=${a.num}]/value, or even hardcoded value for aKey.
Even when I remove the predicate, the parser goes through the B file once for every row in A, which is obviously inefficient.
My question is: how, in light of the problems listed above, do I index the data correctly and efficiently with the DataImportHandler?
I'm using Solr 3.6.2 .
Note: This is a bit similar to this question, but it deals with two XML document types instead of a RDBMS and an XML document.
I have very bad experiences using DataImportHandler for that kind of data. A simple python script to merge your data would probably be smaller than your current configuration and much more readable. Depending on your requirements and data size, you could create a temporary xml file or you could directly pipe results to SOLR. If you really have to use the DataImportHandler, you could use a URLDataSource and setup a minimal server which generates your xml. Obvioulsy I'm a Python fan, but it's quite likely that it's also an easy job in Ruby, Perl, ...
I finally went with another solution due to an additional design requirement I didn't originally mention. What follows is the explanation and discussion. So....
If you only have one or a couple of import flow types for your Solr instances:
Then it might be best to go with Achim's answer and develop your own importer - either, as Achim suggests, in your favorite scripting language, or, in Java, using SolrJ's
ConcurrentUpdateSolrServer.
This is because the DataImportHandler framework does have a sudden spike in its learning curve once you need to define more complex import flows.
If you have a nontrivial number of different import flows:
Then I would suggest you consider staying with the DataImportHandler since you will probably end up implementing something similar anyway. And, as the framework is quite modular and extendable, customization isn't a problem.
This is the additional requirement I mentioned, so in the end I went with that route.
How I solved my particular quandary was indexing the files I needed to reference into separate cores and using a modified SolrEntityProcessor to access that data. The modifications were as follows:
applying the patch for the sub-entity problem,
adding caching (quick solution using Guava, there's probably a better way using an available Solr API for accessing other cores locally, but I was in a bit of a hurry at that point).
If you don't want to create a new core for each file, an alternative would be an extension of Achim's idea, i.e. creating a custom EntityProcessor that would preload the data and enable querying it somehow.

Need help indexing XML files into Solr using DataImportHandler

I don't know java, I don't know XML, and I don't know Lucene. Now that that's out of the way. I have been working to create a little project using apache solr/lucene. My problem is that I am unable to index the xml files. I think I understand how its supposed to work but I could be wrong. I am not sure what information is required for you to help me so I will just post the code.
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<!-- This first entity block will read all xml files in baseDir and feed it into the second entity block for handling. -->
<entity name="AMMFdir" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor"
fileName="^*\.xml$" recursive="true"
baseDir="C:\Documents and Settings\saperez\Desktop\Tomcat\apache-tomcat-7.0.23\webapps\solr\data\AMMF_New"
>
<entity
processor="XPathEntityProcessor"
name="AMMF"
pk="AcquirerBID"
datasource="AMMFdir"
url="${AMMFdir.fileAbsolutePath}"
forEach="/AMMF/Merchants/Merchant/"
transformer="DateFormatTransformer, RegexTransformer"
>
<field column="AcquirerBID" xpath="/AMMF/Merchants/Merchant/AcquirerBID" />
<field column="AcquirerName" xpath="/AMMF/Merchants/Merchant/AcquirerName" />
<field column="AcquirerMerchantID" xpath="/AMMF/Merchants/Merchant/AcquirerMerchantID" />
</entity>
</entity>
</document>
Example xml file
<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
<Merchants Count="153">
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029881</AcquirerBID>
<AcquirerName>WorldPay</AcquirerName>
<AcquirerMerchantID>*</AcquirerMerchantID>
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029882</AcquirerBID>
<AcquirerName>WorldPay2</AcquirerName>
<AcquirerMerchantID>Hello World!</AcquirerMerchantID>
</Merchant>
</Merchants>
I have this in schema.
<field name="AcquirerBID" type="string" indexed="true" stored="true" required="true" />
<field name="AcquirerName" type="string" indexed="true" stored="true" />
<field name="AcquirerMerchantID" type="string" indexed="true" stored="true"/>
I have this in config.
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" default="true" >
<lst name="defaults">
<str name="config">AMMFconfig.xml</str>
</lst>
</requestHandler>
The sample XML is not well formed. This might explain errors indexing the files:
$ xmllint sample.xml
sample.xml:13: parser error : expected '>'
</Merchants>
^
sample.xml:14: parser error : Premature end of data in tag Merchants line 3
sample.xml:14: parser error : Premature end of data in tag AMMF line 2
Corrected XML
Here's what I think your sample data should look like (Didn't check the XSD file)
<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
<Merchants Count="153">
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029881</AcquirerBID>
<AcquirerName>WorldPay</AcquirerName>
<AcquirerMerchantID>*</AcquirerMerchantID>
</Merchant>
<Merchant ChangeIndicator="A" LocationCountry="840">
<AcquirerBID>10029882</AcquirerBID>
<AcquirerName>WorldPay2</AcquirerName>
<AcquirerMerchantID>Hello World!</AcquirerMerchantID>
</Merchant>
</Merchants>
</AMMF>
Alternative solution
I know you said you're not a programmer, but this task is significantly simpler, if you use the solrj interface.
The following is a groovy example which indexes your example XML
//
// Dependencies
// ============
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument
#Grapes([
#Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
])
//
// Main
// =====
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
def i = 1
new File(".").eachFileMatch(~/.*\.xml/) {
it.withReader { reader ->
def ammf = new XmlSlurper().parse(reader)
ammf.Merchants.Merchant.each { merchant ->
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", i++)
doc.addField("bid_s", merchant.AcquirerBID)
doc.addField("name_s", merchant.AcquirerName)
doc.addField("merchantId_s", merchant.AcquirerMerchantID)
server.add(doc)
}
}
}
server.commit()
Groovy is a Java scripting language that does not require compilation. It would be just as easy to maintain as a DIH config file.
To figure out how DIH XML import works, I suggest you first carefully read this chapter in DIH wiki: http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example.
Open the Slashdot link http://rss.slashdot.org/Slashdot/slashdot in your browser, then right click on the page and select View source. There's the XML file used in this example.
Compare it with XPathEntityProcessor configuration in DIH example and you'll see how easy it is to import any XML file in Solr.
If you need more help just ask...
Often the best thing to do is NOT use the DIH. How hard would it be to just post this data using the API and a custom script in a language you DO know?
The benefit of this approach is two-fold:
You learn more about your system, and know it better.
You don't spend time trying to understand the DIH.
The downside is that you're re-inventing the wheel a bit, but the DIH is quite a thing to understand.

Resources