my company uses an external company to provide their search needs (SLI) I have implemented Solr instead as it is free and frankly superior. However SLI provide a feature where if you search for special keywords like "help" or "contact" the response from SLI will not include the content and instead only a few nodes similar to
<response>
<merch><jumpurl>http://somedomain.com/somejumpurl</jumpurl></merch>
</response>
Any ideas how I can provide this feature with Solr?
What you are looking for is then a type of "sponsored search".
Something similar to that in Solr can be achieved with the QueryElevationComponent.
You need to configure it in your solrconfig.xml, then make a dedicated field to use it and then create an external xml file with your special words and rules you want to apply, for example:
<elevate>
<query text="AAA">
<doc id="A" />
<doc id="B" />
</query>
<query text="ipod">
<doc id="A" />
<!-- you can optionally exclude documents from a query result -->
<doc id="B" exclude="true" />
</query>
</elevate>
And then use it in this way:
http://host/solr/elevate?q=YYYY&debugQuery=true&enableElevation=true
If you want to return only the results specified in the elevation file, add exclusive=true to the URL:
http://host/solr/elevate?q=YYYY&debugQuery=true&exclusive=true
Related
I recently began to learn solr, for me some things remain incomprehensible, I will explain what I'm trying to do, please tell me which way to go.
I need a web application in which it will be possible to save data, some fields from which will be in the form of text, some in the form of a file, how to add fields in the form of text is understandable, it is impossible to add files, or their contents as text, in this case I do not know where to store the file itself?
If you need to find a file and it will be known only a couple of words from the entire file, I want all the files to appear in which there are these words, should I add a separate database in this case? If so, where to store the files? if not, the same question.
I would be very pleased and understandable to look at it on some example, maybe you have a link?
This is far too wide and non-specific to give an answer you can just implement; in general you'd submit the documents together with an id to Solr (through Tika in the Extracting Request Handler / Solr Cell).
The documents itself will have to be stored somewhere else, as Solr doesn't handle document storage for you. They can be stored on a cloud service, on a network drive or a local disk - this will depend on your web application.
Your application will then receive the file from the user, store a database row assigning the file to the user, store the file somewhere (S3/GoogleCloudStorage/Local path) under a well-known name (usually the id of the row from the database) and submit the content to Solr for indexing - together with metadata (such as the user id) and the file id.
Searching will then give you the id back and you can retrieve the document from wherever you stored it.
As MatsLindh already mentioned a approach to achieve what you are looking for.
Here are some step by which you can index the files with known location.
Update the solrConfig.xml with below lines
<!-- Load Data Import Handler and Apache Tika (extraction) libraries -->
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar"/>
<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar"/>
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">tika-data-config.xml</str>
</lst>
</requestHandler>
Create a file named tika-data-config.xml under the G:\Solr\TikaConf\conf folder. with below configuration. This location could be different for you.
<dataConfig>
<dataSource type="BinFileDataSource"/>
<document>
<entity name="file" processor="FileListEntityProcessor" dataSource="null"
baseDir="G:/Solr/solr-7.7.2/example/exampledocs" fileName=".*xml"
rootEntity="false">
<field column="file" name="id"/>
<entity name="pdf" processor="TikaEntityProcessor"
url="${file.fileAbsolutePath}" format="text">
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
Add the below fields in your schema.xml
<field name="text" type="text_general" indexed="true" stored="true" multiValued="false"/>
Update the solrConfig xml file as below in order to disable the schemaless mode
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
Go to the solr admin page and select the core you created and click on data import.
Once data is imported or indexed, you can verify the same by querying it.
If you file location is dynamic, means you are retrieving the file location from the database and then that would be your first entity which is retrieving the information from your database about the files metadata like id,name,author and the file path etc..In the second entity which is TikaEntityProcessor, pass the file path and get the content of the file indexed...
I am trying to implement delta-import in solr indexing its working fine,in case when i am indexing data from database.But i want to implement it on filebased datasource.
My data-config.xml file is like
dataSource type="com.solr.datasource.DataSource" name="SuggestionsFile"/>
<document name="suggester">
<entity name="file" dataSource="SuggestionsFile">
<field column="suggestion" name="suggestion" />
</entity>
and i am using DataImportHandler in solrconfig.xml file.i am not able to post my config file,i tried to post,but i don't know why not its showing.
My DataSource class read the text file and return list of data,that solr index .Its working fine in case of full-import but not working in case of delta-import.Pls suggest what else i need to do.
The FileDataSourceEntityProcessor supports filtering the list based on the "newerThan" attribute:
<entity
name="fileimport"
processor="FileListEntityProcessor"
newerThan="${dataimporter.last_index_time}"
.. other options ..
>
...
</entity>
There's a complete example available online.
Let's say I have two XML document types, A and B, that look like this:
A:
<xml>
<a>
<name>First Number</name>
<num>1</num>
</a>
<a>
<name>Second Number</name>
<num>2</num>
</a>
</xml>
B:
<xml>
<b>
<aKey>1</aKey>
<value>one</value>
</b>
<b>
<aKey>2</aKey>
<value>two</value>
</b>
</xml>
I'd like to index it like this:
<doc>
<str name="name">First Name</str>
<int name="num">1</int>
<str name="spoken">one</str>
</doc>
<doc>
<str name="name">Second Name</str>
<int name="num">2</int>
<str name="spoken">two</str>
</doc>
So, in effect, I'm trying to use a value from A as a key in B. Using DataImportHandler, I've used the following as my data config definition:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="document" transformer="LogTransformer" logLevel="trace"
processor="FileListEntityProcessor" baseDir="/tmp/somedir"
fileName="A.*.xml$" recursive="false" rootEntity="false"
dataSource="null">
<entity name="a"
transformer="RegexTransformer,TemplateTransformer,LogTransformer"
logLevel="trace" processor="XPathEntityProcessor" url="${document.fileAbsolutePath}"
stream="true" rootEntity="true" forEach="/xml/a">
<field column="name" xpath="/xml/a/name" />
<field column="num" xpath="/xml/a/num" />
<entity name="b" transformer="LogTransformer"
processor="XPathEntityProcessor" url="/tmp/somedir/b.xml"
stream="false" forEach="/xml/b" logLevel="trace">
<field column="spoken" xpath="/xml/b/value[../aKey=${a.num}]" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
However, I encounter two problems:
I can't get the XPath expression with the predicate to match any rows; regardless of whether I use an alternative like /xml/b[aKey=${a.num}]/value, or even hardcoded value for aKey.
Even when I remove the predicate, the parser goes through the B file once for every row in A, which is obviously inefficient.
My question is: how, in light of the problems listed above, do I index the data correctly and efficiently with the DataImportHandler?
I'm using Solr 3.6.2 .
Note: This is a bit similar to this question, but it deals with two XML document types instead of a RDBMS and an XML document.
I have very bad experiences using DataImportHandler for that kind of data. A simple python script to merge your data would probably be smaller than your current configuration and much more readable. Depending on your requirements and data size, you could create a temporary xml file or you could directly pipe results to SOLR. If you really have to use the DataImportHandler, you could use a URLDataSource and setup a minimal server which generates your xml. Obvioulsy I'm a Python fan, but it's quite likely that it's also an easy job in Ruby, Perl, ...
I finally went with another solution due to an additional design requirement I didn't originally mention. What follows is the explanation and discussion. So....
If you only have one or a couple of import flow types for your Solr instances:
Then it might be best to go with Achim's answer and develop your own importer - either, as Achim suggests, in your favorite scripting language, or, in Java, using SolrJ's
ConcurrentUpdateSolrServer.
This is because the DataImportHandler framework does have a sudden spike in its learning curve once you need to define more complex import flows.
If you have a nontrivial number of different import flows:
Then I would suggest you consider staying with the DataImportHandler since you will probably end up implementing something similar anyway. And, as the framework is quite modular and extendable, customization isn't a problem.
This is the additional requirement I mentioned, so in the end I went with that route.
How I solved my particular quandary was indexing the files I needed to reference into separate cores and using a modified SolrEntityProcessor to access that data. The modifications were as follows:
applying the patch for the sub-entity problem,
adding caching (quick solution using Guava, there's probably a better way using an available Solr API for accessing other cores locally, but I was in a bit of a hurry at that point).
If you don't want to create a new core for each file, an alternative would be an extension of Achim's idea, i.e. creating a custom EntityProcessor that would preload the data and enable querying it somehow.
I have some XML to ingest into Solr, which sounds like a use case that is intended to be solved by the DataImportHandler. What I want to do is pull the column name from one XML attribute and the value from another attribute. Here is an example of what I mean:
<document>
<data ref="reference.foo">
<value>bar</value>
</data>
</document>
From this xml snippet, I want to add a field with name reference.foo and value bar. The DataImportHandler includes a XPathEntityProcessor for processing XML documents. I've tried using it and it works perfectly if I give it a known column name (e.g, <field column="ref" xpath="/document/data/#ref">) but have not been able to find any documentation or examples to suggest either how to do what I want, or that it cannot be done. So:
Can I do this using XPathEntityProcessor? If so, how?
If not, can I do this some other way with DataImportHandler?
Or am I left with writing my own import handler?
I haven't managed to find a way to do this without bringing in a transformer, but by using a simple ScriptTransformer I worked it out. It goes something like this:
...
<script>
function makePair(row) {
var theKey = row.get("theKey");
var theValue = row.get("theValue");
row.put(theKey, theValue);
row.remove("theKey");
row.remove("theValue");
return row;
}
</script>
...
<entity name="..."
processor="XPathEntityProcessor"
transformer="script:makePair"
forEach="/document"
...>
<field column="theKey" xpath="/document/data/#ref" />
<field column="theValue" xpath="/document/data/value" />
</entity>
...
Hope that helps someone!
Note, if your dynamicField is multivalued, you have to iterate over theKey since row.get("theKey") will be a list.
What you want to do is select the node keying on an attribute value.
From your example, you'd do this:
<field column="ref" xpath="/document/data[#ref='reference.foo']"/>
I have the following field defined in solr (schema.xml)
<field name="store" type="location" indexed="true" stored="true"/>
If I search for say this-
&fq={!geofilt pt=45.15,-93.85 sfield=store d=5}
Then I can see the location coordinates in the search result.
But the field "store" seems to be a hidden field under normal circumstances. How do I get the coordinates to be a part of the search result for normal searches? (q=*:* for example)
I just verified that this works correctly for both Solr 3.1 and Solr 4.0-dev with the example data.
Example:
http://localhost:8983/solr/select?q=:&fl=id,store&wt=json&indent=true
[...]
"response":{"numFound":17,"start":0,"docs":[
{
"id":"SP2514N",
"store":"35.0752,-97.032"},
{
"id":"6H500F0",
"store":"45.17614,-93.87341"},
{
"id":"F8V7067-APL-KIT",
"store":"45.18014,-93.87741"},
[...]
Did you perhaps change this setting and forget to re-index or forget to commit?