Storing and searching data pairs in Solr

Storing and searching data pairs in Solr - solr

I'm trying to figure out how to store and search data pairs. I have a document similar to that below and my goal is to perform a search that returns all documents with a given specialty and then sort the results by the matching specialty ability:
<doc>
<id>123</id>
<firstName>Joe</firstName>
<lastName>Bloggs</lastName>
<specialties>
<specialty>
<type>Foo</type>
<ability>1</ability>
</specialty>
<specialty>
<type>Bar</type>
<ability>2</ability>
</specialty>
<specialty>
<type>Baz</type>
<ability>2</ability>
</specialty>
</specialties>
</doc>
I'm familiar with working indexing, searching and faceting simple documents but I am struggling to even find a starting point with this =(
Should I simply use two collections and join?

If the number of specialties is finite and known beforehand, you may try following.
Instead of having two fields storing specialty and ability, just have ONE field containing "ability_of_a_specialty"
For example,
<specialties>
<Foo_ability> 1 </Foo_ability>
<Bar_ability> 2 </Bar_ability>
<Dummy_ability> 0 </Dummy_ability>
...
</specialties>
Now, it should be straightforward to transform the above attributes to a Lucene doc.

Related

How to extract BM25 score from Solr

for an document auto-tagging system we would like to apply tags based on Solrs BM25 measure.
Our algorithm should perform like this:
indexed documents with applied tags already stored in Solr
new documents without tags are posted => apply tags based on the nearest neighbor of this document (afaik document with best BM25)
So my questions:
Is this feasible? Can I extract the BM25-score out of Solr? This could require first indexing a document get the nearest neighbor and his tags and then deleting the new doc and re-index with applied tags from nearest neighbor
Is this in general a good Idea to do so?

Merging Solr query results through SolrNet

I'm using Sorl v3.6.1 and have successfully managed to index data as well as using Apache Tika to index binary items. I'm using SolrNet to pull this data out. However I have an issue whereby I want to link 2 results together.
Now consider the following XML (this is just for illustration purposes):
<doc>
<id>263</id>
<title>This is the title</title>
<summary>This is the summary<summary/>
<binary_id>994832</binary_id>
</doc>
<doc>
<id>994832</id>
<title>This is the title</title>
<summary>This is the summary<summary/>
<text>this is the contents of the binary</text>
</doc>
Is it possible (through SolrNet) to merge the two above results together so when a user searches for This is the contents of the binary it also returns the data in the first item?
In my example you can see the first item contains the id of the binary (994832) so my initial thoughts are that I need to do 2 queries and somehow merge them?
Not really sure about how to do this so any help would be greatly appreciated.

You can try to do something funky with a join kind of query, however beware of performance impacts. Here is my post from some time ago where I was trying to do something similar.
solr grouping based on all index values
Alternatively, a better solution, IF and only IF you can massage the data a bit before going in. Would be to assign the same ID to all documents that need to be retrieved as a group, per your example, this would be to add binaryid field to the second doc and assign 994832 value to it. You would be able to very cleanly use Solr grouping to group the items as one and then group sorting to only return the item that you want.

Apache Solr or Lucene proximity search on multiple fields

Is it possible in solr/lucene to search on different multivalued fields?
Imagine to have an XML fragment like this:
<normative>
<ref><aut>State</aut><num>70</num>><year>2007</year><article>13</article></ref>
<ref><aut>TreasuryMinistry</aut><num>350</num><year>2011</year><article>21</article></ref>
</normative>
Is it possible to retrieve documents containing for instance:
num:70 AND year:2007
inside the same ref ?
i.e. this document should not be found for a query like
num:70 AND year:2011.
I could create catenated fields like
<ref cat='state-0070-2007-0013'/>
<ref cat='TreasuryMinistry-0350-2011-0021'/>
but the user must be able to find by every combination of fields, i.e.
num and year,
year and article,
num and article,
aut and num and year,
on the same ref!
I am not experienced with solr/lucene, so I fear that a wild card search like
cat:'*-0070-2007-*'
could not be not performant over our normative document corpus.
Is there a way to make a search based on relative position?
Something like using copyField to a multivalue field with different positionincrementGaps?

Not directly answering your proximity question, but can you treat each as a document? If so, then a search like 'num:70 AND year:2007' should work fine, assuming you create the 'num' and 'year' fields.

Combine solr's document score with a static, indexed score

I have people indexed into solr based on documents that they have authored. For simplicity's sake, let's say they have three fields - an integer ID, a Text field and a floating point 'SpecialRank' (a value between 0 and 1 to indicate how great the person is). Relevance matching in solr is all done through the Text field. However, I want my final result list to be a combination of relevance to the query as provided by solr and my own SpecialRank. Namely, I need to re-rank the results based on the following formula:
finalScore = (0.8 * solrScore) + (0.2 * SpecialScore)
As far as I'm aware, this is a common task in information retrieval, as we are just combining two different scores in a weighted manner. The trouble is, I need solrScore to be normalized for this to work. What I have been doing is normalizing the solrScore based on the maxScore for a particular query and re-ranking the results client-side. This has been working OK, but means I have to retrieve all the matching documents from solr before I do my re-ranking.
I am looking for the best way to have solr take care of this re-ranking. Are boost functions able to help here? I have read that they can be multiplicative or additive to the solr score, but since the solr score is not normalized and all over the place depending on different queries, this doesn't really seem to solve my problem. Another approach I have tried is to first query solr for a single document just to get the maxScore, and then use the following formula for the sort:
sum(product(0.8,div(score,maxScore)),product(0.2,SpecialRank))+desc
This, of course, doesn't work as you're unable to use the score as a variable in a sort function.
Am I crazy here? Surely this is a common enough task in IR. I've been banging my head against the wall for a while now, any ideas would be much appreciated.

You could try to implement custom SearchComponent that will go trough results on Solr and calculate your custom score there. Get results found from ResponseBuilder (rb.getResults().docSet), iterate trough them, add calculated value to your results and re-sort them.
You can then register your SearchComponent as last in RequestHandler chain:
<arr name="last-components">
<str>elevator</str>
</arr>
More info in SolR manual:
http://wiki.apache.org/solr/SearchComponent
Sorry, but no better idea for now.

Is it possible to filter the fields returned in a solr Document?

I have solr documents that look like this:
<doc>
<str name="some_attribute">some_attribute_value</str>
<!-- ... -->
<arr name="locationCoordinates">
<str>48.117,11.539</str>
<str>23.423,11.342</str>
<!-- ... -->
</arr>
</doc>
My question is whether it's possible to filter the returned fields of a document to only return certain values, for example to only return the locationCoordinates that are within a 50 km range of another point and leave the others out.
I.e. return the above document, but with only the first locationCoordinates.
I don't really know whether this should even be possible in Solr (because of the document-oriented structure), but I can at least ask :).
Maybe I should also elaborate on the way I want to use this feature and alternatives I "found" for this:
change the document design to create one document per location (Pro: works, Cons: need to check for duplicates on the client side, heaps of duplicated data in the Solr-database)
leave it with this structure (Pro: works, don't have to change the current structure, Cons: I have to sort the correct coordinates out by myself (on the client) and therefore encounter problems with distance calculation (I already filter the documents by distance beforehand, and maybe I will lose some data if I compute the distance on the client-side badly)
create a new Document "type" for the locations (and their names, etc.) on the Solr-side and use a foreign-key-like structure to add the locations to the articles and in order to compute distances I have to query for reachable locations first and then join on the articles (Pro: everything works on the solr-side, Cons: I will need Solr-Joins for that)

change the document design to create one document per location (Pro: works, Cons: need to check for duplicates on the client side, heaps of
duplicated data in the Solr-database)
If this is your only concern, probably you can consider making the locationCoordinates as your unique_key for each document.
By doing this, you are not allowing duplicate locationCoordinates in your Index, thereby eliminating the need to check for duplicates on the client side.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight