I am trying to do indexing on PDF documents with TIKA.
I am using org.apache.solr.common.SolrInputDocument() to add different fields to documents like id, title, author, url.
In URL field I am giving path of the file which is going to be indexded.
Currently my local system path:
C:/Users/abcd/workspace/SOLRRichDocs/resources/apache-solr-ref-guide-5.1-001.pdf
But in output this field is simply coming as text (not clickable).
I have the requirement to output in "url" field as hyperlink so that I can go to the document.
This is the output I am getting:
<result name="response" numFound="1" start="0">
<doc>
<str name="id">24d0331c-7db8-42c0-ae57-f0a87b4cc798</str>
<str name="url">C:/Users/abcd/workspace/SOLRRichDocs/resources/apache-solr-ref-guide-5.1-001.pdf</str>
<long name="_version_">1507935693954875392</long></doc>
</result>
I need hyperlink in url field.
Could you all experts share your Inputs.
To achieve the same you need add the code at the client end.
Solr will give the plain text that is been stored...
Like show the url in the HTML tag..
for example
Visit W3Schools.com!
here u can add the text from the solr response...and it would a hyperlink for you.
Related
This is newbie question, but trying understand solr search handler with defaults enabled.
As per my understanding df is used for Default search field, what will happen/how does it work if remove this part?
My scenario currently: I am pushing/adding all fields data (e.g title, department, etc.) to text. So when I search free text it can be searchable (using edismax) and it works as expected. But when tomorrow new fields come, I mean suppose it will be title, department , somefield, then I have to reindex all data again. Then it will be searchable. How can I achieve that? Can I have multiple fields in df?
example : <str name="df">title, department, etc,etc </str> like this ?
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<!-- Default search field -->
<str name="df">text</str>
</lst>
Since you're using edismax, you can provide the qf argument with multiple fields instead - and weights between them. Another option is to have one common field, i.e. something like text and a copyField instruction that copies everything (source set to *) to text (dest set to text), then search against that field.
Using either won't require a reindex, since the content will either be copied from all documents indexed as they're being indexed, or the search will consider more fields (with qf).
You can set the default value for the qf in the same was a you've shown above in the config file under defaults.
SOLR has a module called Cell. It uses Tika to extract content from documents and index it with SOLR.
From the sources at https://github.com/apache/lucene-solr/tree/master/solr/contrib/extraction , I conclude that Cell places the raw extracted text document text into a field called "content". The field is indexed by SOLR, but not stored. When you query for documents, "content" doesn't come up.
My SOLR instance has no schema (I left the default schema in place).
I'm trying to implement a similar kind of behavior using the default UpdateRequestHandler (POST to /solr/corename/update). The POST request goes:
<add commitWithin="60000">
<doc>
<field name="content">lorem ipsum</field>
<field name="id">123456</field>
<field name="someotherfield_i">17</field>
</doc>
</add>
With documents added in this manner, the content field is indexed and stored. It's present in query results. I don't want it to be; it's a waste of space.
What am I missing about the way Cell adds documents?
If you don't want your field to store the contents, you have to set the field as stored="false".
Since you're using the schemaless mode (there still is a schema, it's just generated dynamically when new fields are added), you'll have to use the Schema API to change the field.
You can do this by issuing a replace-field command:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field":{
"name":"content",
"type":"text",
"stored":false }
}' http://localhost:8983/solr/collection/schema
You can see the defined fields by issuing a request against /collection/schema/fields.
The Cell code indeed adds the content to the document as content, but there's a built-in field translation rule that replaces content with _text_. In the schemaless SOLR, _text_ is marked as not for storing.
The rule is invoked by the following line in the SolrContentHandler.addField():
String name = findMappedName(fname);
In the params object, there's a rule that fmap.content should be treated as _text_. It comes from corename\conf\solrconfig.xml, where by default there's the following fragment:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str> <!-- This one! -->
</lst>
</requestHandler>
Meanwhile, in corename\conf\managed_schema there's a line:
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
And that's the whole story.
I am new to big data environment, hence apologizing first if the below query is meaningless.
I want to read a word / pdf document and index those documents in SolR . I understand that SolR accepts a JSON or XML format and not a word / pdf /txt files. Is it necessary to convert a word / pdf document into JSON or XML before sending the document to SolR? I initially thought I should use Tika, but my understanding is that Tika can convert a pdf to text and not to JSON.
Could you please guide how to index in Solr?
Thanks for the help
The standard endpoint for indexing 'rich files' are at update/extract, so if you post your file to that destination, Solr will run it through Tika internally, extract the text and properties. You can provide literal values through the URL (such as an ID, filename, other metadata) with literal.fieldname=value arguments.
The Uploading Data with Solr Cell using Apache Tika description in the manual gives you a low-level introduction to how to submit documents with curl through HTTP, as well as which configuration options are required to enable automagic extraction (which is enable on a few of the examples (data driven, tech products iirc)):
If you are not working with the supplied sample_techproducts_configs or data_driven_schema_configs config set, you must configure your own solrconfig.xml to know about the Jar's containing the ExtractingRequestHandler and it's dependencies:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />`
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
You can then configure the ExtractingRequestHandler in solrconfig.xml.
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>
<!--Optional. Specify a path to a tika configuration file. See the Tika docs for details.-->
<str name="tika.config">/my/path/to/tika.config</str>
<!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
<lst name="date.formats">
<str>yyyy-MM-dd</str>
</lst>
<!-- Optional. Specify an external file containing parser-specific properties.
This file is located in the same directory as solrconfig.xml by default.-->
<str name="parseContext.config">parseContext.xml</str>
</requestHandler>
I am trying to perform a search and Solr does not return me any results when I search with default text entry,It works when I mention the field name in the query browser.Ex q contact:Ajay returns the contact But I need to return with only Ajay as search field.Please help.
Check the default field defined in the initParams section of solrconfig.xml. You can update, df parameter to the field you want as the default field. Here is the default configuration from solr 5.2.0. You can use any field in place of text, which is default for all the listed requestHandler in the path.
<initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell">
<lst name="defaults">
<str name="df">text</str>
</lst>
</initParams>
If you want to provide search on the all fields of your schema. You need need to create an new field, let say search_field, then copy all the fields into it, using copyField definition. You the search_field in the initParams.
<copyField source="field1" dest="search_field"/>
I am trying to Implement Apache Solr search through SolrNet library.So far I have managed to run an instance of Solr in my machine and make some queries based on specific fields.
My code to do it looks like this
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<Product>>();
var results = solr.Query(new SolrQueryByField("id", "SP2514N"));
This one works fine now,But I would like to make queries with out specifying a field , So that when I enter a search key word solr will look in to the all fields available and return a result.I have Found the code to make it in SolrNet library from here
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<Product>>();
var results = solr.Query(new SolrQuery("SP2514N"));
But this never worked,When I drilled down to bottom ,I found that I need to set default search fields in Solr instance so that Solr will search that fields when nothing else is selected(This is how i understood it I am not sure about this).
So I went to set default fields in Solr ,I took Solrconfig.XML and edited it like this
<requestHandler name="/query" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="df">text</str>
<str name="df">id</str>
</lst>
</requestHandler>
[just added <str name="df">id</str> this field as extra].But this too never helped And I am stuck ,Can any one tell me How I could set default search field in Solr correctly?Or am i doing any thing else wrong?
I have Uploaded My Solrconfig file here
I do not know about SolrNet library, but to make a default field for search you need to define DefaultSearchField in schema.xml i.e. <defaultSearchField>FieldName</defaultSearchField>.
You can find this file # <SOLR_HOME>\apache-solr-3.6.0\example\example-DIH\solr\testsyndrome\conf\schema.xml
I hope that's what you are looking for.
Don't start from SolrNet, use Solr's built-in Web Admin interface. Iterate there until you understand the request handlers and the parameters. Then, go back to SolrNet.
In your case, it seems that you changed default request handler and tried to use df parameter twice. I would stick to the original request handler for now just to avoid the extra issue.
With using df parameter, are you trying to search a single field or multiple fields? If single field, keep only one value for the parameter. If multiple, you need to switch to using eDisMax, where you can provide a set of fields.
Again, admin interface lets you experiment with it, then you can add it into the handler's default parameter.