How to index a pdf / word doc in Apache SolR - solr

I am new to big data environment, hence apologizing first if the below query is meaningless.
I want to read a word / pdf document and index those documents in SolR . I understand that SolR accepts a JSON or XML format and not a word / pdf /txt files. Is it necessary to convert a word / pdf document into JSON or XML before sending the document to SolR? I initially thought I should use Tika, but my understanding is that Tika can convert a pdf to text and not to JSON.
Could you please guide how to index in Solr?
Thanks for the help

The standard endpoint for indexing 'rich files' are at update/extract, so if you post your file to that destination, Solr will run it through Tika internally, extract the text and properties. You can provide literal values through the URL (such as an ID, filename, other metadata) with literal.fieldname=value arguments.
The Uploading Data with Solr Cell using Apache Tika description in the manual gives you a low-level introduction to how to submit documents with curl through HTTP, as well as which configuration options are required to enable automagic extraction (which is enable on a few of the examples (data driven, tech products iirc)):
If you are not working with the supplied sample_techproducts_configs or data_driven_schema_configs config set, you must configure your own solrconfig.xml to know about the Jar's containing the ExtractingRequestHandler and it's dependencies:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />`
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
You can then configure the ExtractingRequestHandler in solrconfig.xml.
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>
<!--Optional. Specify a path to a tika configuration file. See the Tika docs for details.-->
<str name="tika.config">/my/path/to/tika.config</str>
<!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
<lst name="date.formats">
<str>yyyy-MM-dd</str>
</lst>
<!-- Optional. Specify an external file containing parser-specific properties.
This file is located in the same directory as solrconfig.xml by default.-->
<str name="parseContext.config">parseContext.xml</str>
</requestHandler>

Related

Understanding solr(8.x) default searchHandler

This is newbie question, but trying understand solr search handler with defaults enabled.
As per my understanding df is used for Default search field, what will happen/how does it work if remove this part?
My scenario currently: I am pushing/adding all fields data (e.g title, department, etc.) to text. So when I search free text it can be searchable (using edismax) and it works as expected. But when tomorrow new fields come, I mean suppose it will be title, department , somefield, then I have to reindex all data again. Then it will be searchable. How can I achieve that? Can I have multiple fields in df?
example : <str name="df">title, department, etc,etc </str> like this ?
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<!-- Default search field -->
<str name="df">text</str>
</lst>
Since you're using edismax, you can provide the qf argument with multiple fields instead - and weights between them. Another option is to have one common field, i.e. something like text and a copyField instruction that copies everything (source set to *) to text (dest set to text), then search against that field.
Using either won't require a reindex, since the content will either be copied from all documents indexed as they're being indexed, or the search will consider more fields (with qf).
You can set the default value for the qf in the same was a you've shown above in the config file under defaults.

Need URL(hyper link) in one of the solr field

I am trying to do indexing on PDF documents with TIKA.
I am using org.apache.solr.common.SolrInputDocument() to add different fields to documents like id, title, author, url.
In URL field I am giving path of the file which is going to be indexded.
Currently my local system path:
C:/Users/abcd/workspace/SOLRRichDocs/resources/apache-solr-ref-guide-5.1-001.pdf
But in output this field is simply coming as text (not clickable).
I have the requirement to output in "url" field as hyperlink so that I can go to the document.
This is the output I am getting:
<result name="response" numFound="1" start="0">
<doc>
<str name="id">24d0331c-7db8-42c0-ae57-f0a87b4cc798</str>
<str name="url">C:/Users/abcd/workspace/SOLRRichDocs/resources/apache-solr-ref-guide-5.1-001.pdf</str>
<long name="_version_">1507935693954875392</long></doc>
</result>
I need hyperlink in url field.
Could you all experts share your Inputs.
To achieve the same you need add the code at the client end.
Solr will give the plain text that is been stored...
Like show the url in the HTML tag..
for example
Visit W3Schools.com!
here u can add the text from the solr response...and it would a hyperlink for you.

Solr : Make XML as response in Solr 4.8.1

I am using solr 4.8.1.
When I execute any query for testing purpose from Dashboard I get response in JSON(BY DEFAULT)
Can I change it and make XML as default.
Plz refer below screen.
I am taking about dashboard only.
Thanks for looking here.... :)
The default values for your requestHandlers (which is what responds when a query is sent to /query or /select etc.), is set in solrconfig.xml. Here's the example from example/solr in the distribution:
<!-- A request handler that returns indented JSON by default -->
<requestHandler name="/query" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="df">text</str>
</lst>
</requestHandler>
Changing wt to xml will give you a requestHandler that returns it response as XML by default, unless overridden at query time with the wt parameter. There might be parts of the web interface that assumes the response will be json, but I'm pretty sure those supply a value for wt anyway.
I dont know if there is administration for web UI defaults, but you can change html easily:
in
solr-4.8.1\example\solr-webapp\webapp\tpl\query.html
change order of options
<select name="wt" id="wt" title="The writer type (response format).">
<option>xml</option>
<option>json</option>
<option>python</option>
<option>ruby</option>
<option>php</option>
<option>csv</option>
</select>
Whatever option you put on first will be default, or set it selected:
<option selected="selected">
You may also change this html in war file in solr-4.8.1\example\webapps.
Note that path is relative to example from 4.8.1 release

Set default search fields in Apache Solr

I am trying to Implement Apache Solr search through SolrNet library.So far I have managed to run an instance of Solr in my machine and make some queries based on specific fields.
My code to do it looks like this
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<Product>>();
var results = solr.Query(new SolrQueryByField("id", "SP2514N"));
This one works fine now,But I would like to make queries with out specifying a field , So that when I enter a search key word solr will look in to the all fields available and return a result.I have Found the code to make it in SolrNet library from here
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<Product>>();
var results = solr.Query(new SolrQuery("SP2514N"));
But this never worked,When I drilled down to bottom ,I found that I need to set default search fields in Solr instance so that Solr will search that fields when nothing else is selected(This is how i understood it I am not sure about this).
So I went to set default fields in Solr ,I took Solrconfig.XML and edited it like this
<requestHandler name="/query" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="df">text</str>
<str name="df">id</str>
</lst>
</requestHandler>
[just added <str name="df">id</str> this field as extra].But this too never helped And I am stuck ,Can any one tell me How I could set default search field in Solr correctly?Or am i doing any thing else wrong?
I have Uploaded My Solrconfig file here
I do not know about SolrNet library, but to make a default field for search you need to define DefaultSearchField in schema.xml i.e. <defaultSearchField>FieldName</defaultSearchField>.
You can find this file # <SOLR_HOME>\apache-solr-3.6.0\example\example-DIH\solr\testsyndrome\conf\schema.xml
I hope that's what you are looking for.
Don't start from SolrNet, use Solr's built-in Web Admin interface. Iterate there until you understand the request handlers and the parameters. Then, go back to SolrNet.
In your case, it seems that you changed default request handler and tried to use df parameter twice. I would stick to the original request handler for now just to avoid the extra issue.
With using df parameter, are you trying to search a single field or multiple fields? If single field, keep only one value for the parameter. If multiple, you need to switch to using eDisMax, where you can provide a set of fields.
Again, admin interface lets you experiment with it, then you can add it into the handler's default parameter.

Can SOLR configuration files be located in parent folders?

I have configured the QueryElevation searchComponent of SOLR as documented here:
http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field
However, I would like to load the elevate.xml file from several folders above the default one.
I cannot get this to work... all of the following generate an error:
<str name="config-file">../../elevate.xml</str>
<str name="config-file">..\..\elevate.xml</str>
<str name="config-file">c:/elevate.xml</str>
<str name="config-file">c:\elevate.xml</str>
Per the Solr wiki:
Path to the file that defines query elevation. This file must exist in:
${instanceDir}/conf/${config-file} , or
${dataDir}/${config-file}

Resources