SOLR has a module called Cell. It uses Tika to extract content from documents and index it with SOLR.
From the sources at https://github.com/apache/lucene-solr/tree/master/solr/contrib/extraction , I conclude that Cell places the raw extracted text document text into a field called "content". The field is indexed by SOLR, but not stored. When you query for documents, "content" doesn't come up.
My SOLR instance has no schema (I left the default schema in place).
I'm trying to implement a similar kind of behavior using the default UpdateRequestHandler (POST to /solr/corename/update). The POST request goes:
<add commitWithin="60000">
<doc>
<field name="content">lorem ipsum</field>
<field name="id">123456</field>
<field name="someotherfield_i">17</field>
</doc>
</add>
With documents added in this manner, the content field is indexed and stored. It's present in query results. I don't want it to be; it's a waste of space.
What am I missing about the way Cell adds documents?
If you don't want your field to store the contents, you have to set the field as stored="false".
Since you're using the schemaless mode (there still is a schema, it's just generated dynamically when new fields are added), you'll have to use the Schema API to change the field.
You can do this by issuing a replace-field command:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field":{
"name":"content",
"type":"text",
"stored":false }
}' http://localhost:8983/solr/collection/schema
You can see the defined fields by issuing a request against /collection/schema/fields.
The Cell code indeed adds the content to the document as content, but there's a built-in field translation rule that replaces content with _text_. In the schemaless SOLR, _text_ is marked as not for storing.
The rule is invoked by the following line in the SolrContentHandler.addField():
String name = findMappedName(fname);
In the params object, there's a rule that fmap.content should be treated as _text_. It comes from corename\conf\solrconfig.xml, where by default there's the following fragment:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str> <!-- This one! -->
</lst>
</requestHandler>
Meanwhile, in corename\conf\managed_schema there's a line:
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
And that's the whole story.
Related
This is newbie question, but trying understand solr search handler with defaults enabled.
As per my understanding df is used for Default search field, what will happen/how does it work if remove this part?
My scenario currently: I am pushing/adding all fields data (e.g title, department, etc.) to text. So when I search free text it can be searchable (using edismax) and it works as expected. But when tomorrow new fields come, I mean suppose it will be title, department , somefield, then I have to reindex all data again. Then it will be searchable. How can I achieve that? Can I have multiple fields in df?
example : <str name="df">title, department, etc,etc </str> like this ?
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<!-- Default search field -->
<str name="df">text</str>
</lst>
Since you're using edismax, you can provide the qf argument with multiple fields instead - and weights between them. Another option is to have one common field, i.e. something like text and a copyField instruction that copies everything (source set to *) to text (dest set to text), then search against that field.
Using either won't require a reindex, since the content will either be copied from all documents indexed as they're being indexed, or the search will consider more fields (with qf).
You can set the default value for the qf in the same was a you've shown above in the config file under defaults.
I recently began to learn solr, for me some things remain incomprehensible, I will explain what I'm trying to do, please tell me which way to go.
I need a web application in which it will be possible to save data, some fields from which will be in the form of text, some in the form of a file, how to add fields in the form of text is understandable, it is impossible to add files, or their contents as text, in this case I do not know where to store the file itself?
If you need to find a file and it will be known only a couple of words from the entire file, I want all the files to appear in which there are these words, should I add a separate database in this case? If so, where to store the files? if not, the same question.
I would be very pleased and understandable to look at it on some example, maybe you have a link?
This is far too wide and non-specific to give an answer you can just implement; in general you'd submit the documents together with an id to Solr (through Tika in the Extracting Request Handler / Solr Cell).
The documents itself will have to be stored somewhere else, as Solr doesn't handle document storage for you. They can be stored on a cloud service, on a network drive or a local disk - this will depend on your web application.
Your application will then receive the file from the user, store a database row assigning the file to the user, store the file somewhere (S3/GoogleCloudStorage/Local path) under a well-known name (usually the id of the row from the database) and submit the content to Solr for indexing - together with metadata (such as the user id) and the file id.
Searching will then give you the id back and you can retrieve the document from wherever you stored it.
As MatsLindh already mentioned a approach to achieve what you are looking for.
Here are some step by which you can index the files with known location.
Update the solrConfig.xml with below lines
<!-- Load Data Import Handler and Apache Tika (extraction) libraries -->
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar"/>
<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar"/>
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">tika-data-config.xml</str>
</lst>
</requestHandler>
Create a file named tika-data-config.xml under the G:\Solr\TikaConf\conf folder. with below configuration. This location could be different for you.
<dataConfig>
<dataSource type="BinFileDataSource"/>
<document>
<entity name="file" processor="FileListEntityProcessor" dataSource="null"
baseDir="G:/Solr/solr-7.7.2/example/exampledocs" fileName=".*xml"
rootEntity="false">
<field column="file" name="id"/>
<entity name="pdf" processor="TikaEntityProcessor"
url="${file.fileAbsolutePath}" format="text">
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
Add the below fields in your schema.xml
<field name="text" type="text_general" indexed="true" stored="true" multiValued="false"/>
Update the solrConfig xml file as below in order to disable the schemaless mode
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
Go to the solr admin page and select the core you created and click on data import.
Once data is imported or indexed, you can verify the same by querying it.
If you file location is dynamic, means you are retrieving the file location from the database and then that would be your first entity which is retrieving the information from your database about the files metadata like id,name,author and the file path etc..In the second entity which is TikaEntityProcessor, pass the file path and get the content of the file indexed...
Alright, I am looking for general guidelines on how to import a CSV file containing the following fields
poi_name, latitude, longitude
into a Solr (7.x) core to perform geo queries? What is the right way to achieve this? I tried
using the bin/postimport creates a useless schema where all the fields are multivalued. Obviously no location field is being created.
doing the same but creating a schema for the 3 fields via the admin UI and I get "Document is missing mandatory uniqueKey field: id". I would like to get the functionality where the id is automatically populated with a random uuid.
and lastly and the most important is how to "compute" a LatLonPointSpatialField from the latitude and longitude. Via the UI there was no way to create a 4th field that utilizes other fields.
Do I really need to go trough the trouble of defining a DataImportHandler to do this or it is sufficient to create a schema for all this?
What if the latitude and longitude are already there and I am trying to update the schema with the location field at a later time?
Can't find a good example for doing this, however there is an old example where the location field is automatically composed if latitude and longitude have predefined names with a suffix something like location_1_coordinate and location_2_coordinate this seems silly!
Just conclude and aggregate the answer for anyone interested this is the solution I came to following MatsLindh suggestion. Context: CentOS 7 and Solr 7.5
Sample.csv content
name,lon,lat,
A,22.9308852,39.3724824
B,22.5094530,40.2725792
relevant portion of the schema (managed-schema file)
<fieldType name="location" class="solr.LatLonPointSpatialField" docValues="true"/>
...
<field name="lat" type="string" omitTermFreqAndPositions="true" indexed="true" required="true" stored="true"/>
<field name="location" type="location" multiValued="false" stored="true"/>
<field name="lon" type="string" omitTermFreqAndPositions="true" indexed="true" stored="true"/>
solrconfig.xml
<updateRequestProcessorChain name="uuid-location">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">lat</str>
<str name="dest">location</str>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">lon</str>
<str name="dest">location</str>
</processor>
<processor class="solr.ConcatFieldUpdateProcessorFactory">
<str name="fieldName">location</str>
<str name="delimiter">,</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
<lst name="defaults">
<str name="df">_text_</str>
<str name="update.chain">uuid-location</str>
</lst>
</initParams>
and to import the sample file into the core run the following in bash
/opt/solr/bin/post -c your_core_name /opt/solr/sample.csv
And if you wonder how to query that data use
http://localhost:8983/solr/your_core_name/select?&q=*:*&fq={!geofilt%20sfield=location}&pt=42.27,-74.91&d=1
where pt is the lat-long point and d is the distance in kilometers.
First - you'll have to define a location field. The schemaless mode is made for quick prototyping, if you need more specific fields (and be sure that the fields get the correct type in production), you'll have to configure them explicitly. Use the LatLonPointSpatialField type for this, and make it single valued.
First define the field type to use (these are adopted from the Schema API documentation):
curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field-type" : {
"name":"location_type",
"class":"LatLonPointSpecialField"
}' http://localhost:8983/solr/gettingstarted/schema
Then add a field with that type:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"location",
"type":"location_type",
"stored":true }
}' http://localhost:8983/solr/gettingstarted/schema
The two other issues can be fixed through a custom update chain (you provide the name of the chain as the update.chain URL parameter when indexing the document).
To automagically assign a guid to any indexed document, you can use the UUIDUpdateProcessorFactory. Give the field name (id) as the fieldName parameter.
To get the latitude and longitude concatenated to a single field with , as the separator, you can use a ConcatFieldUpdateProcessorFactory. The important thing here is that it concatenates a list of values given for a single valued field into a single value - it does not concatenate two different field names. To fix that we can use a CloneFieldUpdateProcessor to move both the latitude and longitude value into a separate field.
<updateRequestProcessorChain name="populate-location">
<processor class="solr.CloneFieldUpdateProcessorFactory">
<arr name="source">
<str>latitude</str>
<str>longitude</str>
</arr>
<str name="dest">location</str>
</processor>
<processor class="solr.ConcatFieldUpdateProcessorFactory">
<str name="delimiter">,</str>
</processor>
</updateRequestProcessorChain
If you add the location field later and already have the data in your database, this won't work. Solr won't touch data that has already been indexed, and you'll have to reindex to get your information processed and indexed the correct way. This is true regardless of how you get content into the location field.
The old example is probably the other way around - earlier you'd send a latlon pair, and it'd get indexed as two separate values - one for latitude and one for longitude - under the hood. You could probably hack around that by sending a single value for each, but it was really meant to work the other way around - sending one value and getting it indexed as two separate fields. Since the geospatial support in Lucene (and Solr) was just starting out, the already existing types were re-used instead of creating more dedicated types.
I am trying to perform a search and Solr does not return me any results when I search with default text entry,It works when I mention the field name in the query browser.Ex q contact:Ajay returns the contact But I need to return with only Ajay as search field.Please help.
Check the default field defined in the initParams section of solrconfig.xml. You can update, df parameter to the field you want as the default field. Here is the default configuration from solr 5.2.0. You can use any field in place of text, which is default for all the listed requestHandler in the path.
<initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell">
<lst name="defaults">
<str name="df">text</str>
</lst>
</initParams>
If you want to provide search on the all fields of your schema. You need need to create an new field, let say search_field, then copy all the fields into it, using copyField definition. You the search_field in the initParams.
<copyField source="field1" dest="search_field"/>
I am trying to do indexing on PDF documents with TIKA.
I am using org.apache.solr.common.SolrInputDocument() to add different fields to documents like id, title, author, url.
In URL field I am giving path of the file which is going to be indexded.
Currently my local system path:
C:/Users/abcd/workspace/SOLRRichDocs/resources/apache-solr-ref-guide-5.1-001.pdf
But in output this field is simply coming as text (not clickable).
I have the requirement to output in "url" field as hyperlink so that I can go to the document.
This is the output I am getting:
<result name="response" numFound="1" start="0">
<doc>
<str name="id">24d0331c-7db8-42c0-ae57-f0a87b4cc798</str>
<str name="url">C:/Users/abcd/workspace/SOLRRichDocs/resources/apache-solr-ref-guide-5.1-001.pdf</str>
<long name="_version_">1507935693954875392</long></doc>
</result>
I need hyperlink in url field.
Could you all experts share your Inputs.
To achieve the same you need add the code at the client end.
Solr will give the plain text that is been stored...
Like show the url in the HTML tag..
for example
Visit W3Schools.com!
here u can add the text from the solr response...and it would a hyperlink for you.