Is the Solr Query possible with function value comparison? - solr

I am working on solr for 3-4 months. I want to know if it is possible to query on solr with following requirements.
return all the documents where,
fieldName1 = queryTerm1 &
strdist(queryTerm2, fieldName2, JW) > 5 (or some constant)
If this is possible, what will be the query?

I guess you can get close.
Sort the results on string distance (split for easier):
localhost:8983/solr/select/?fl=id
&q=fieldName1:queryTerm1
&sort=strdist("queryTerm2",fieldName2, JW) desc
which will order the results, highest string distance downwards.
Note that you cannot directly get the string distance. There is a pseudo-field score, retrieved by:
fl=id,score
but it means nothing in an absolute sense.
You can also boost results based on the string distance, instead of simply sorting them. In this case, it will look at the relevancy of the document as well as the string distance.
Once you have a sorted list (hope its not too large!), you can determine client-side the elements which have 'string distance < 5'.
I made this up from the links below.
http://yonik.wordpress.com/2011/03/10/solr-relevancy-function-queries/
http://wiki.apache.org/solr/FunctionQuery#strdist

as far as i know, it's not possible

Related

Apache Solr's bizarre search relevancy rankings

I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.

How can I sort facets by their tf-idf score, rather than popularity?

For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.

Solr non sorted results order

I was wondering if I can count on the results order of Solr queries if the queries were not sorted.
For example:
Lets assume there are 100 documents and I want to provide paging by running 10 queries of 10 docs each, where I increment the start position each time.
If I will run a *:* 10 times while increment the start position by 10 each time can I assume I'll get all 100 docs or since there is no sorting each time I'll get a different random 10 documents.
I know that in SQL databases it worn't work, I was wondering if Solr is different.
Solr is different. In case that you do not specify a sort order, it will sort by score. In case the score is equal, it will sort by the moment a document got indexed.
In case that you query for *:* (MatchAllDocsQuery) all documents do have the same score and will be returned in the order they got indexed, as described in the SO question How are results ordered in solr in a "match all docs" query
When would the order change? In case one of the documents gets updated. Then it will fall behind its' older brothers.
But Solr has a RandomSortField for this matter:
Utility Field used for random sorting. It should not be passed a value.
This random sorting implementation uses the dynamic field name to set the random 'seed'. To get random sorting order, you need to use a random dynamic field name.

Sort by score & an int field value

I need to sort documents returned in order of (descending) score & (descending) value of an int field within the document. How do I ensure proper sort order as well as good performance ?
I don't need the sort-order defined by sort=score desc,intField desc.
The sort order needs to be somewhat like what will be delivered using product function of score*fieldVal as effective score for sort order. But I don't need exact product for sorting. Approximations are ok, & this is just to roughly define the sort order I need.
I can see a few possible ways to accomplish this:
1. Use a customized function of score for sort
2. Use query time boost to increase the score using the int field value for boost
I'm new to Solr & don't understand the performance implications of each of above case. Also, don't know if there are other better ways to accomplish what I am trying to do. So how do I build performance friendly query to achieve this sort order ?
Have a look at solr function queries
https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-SortByFunction
Example:
&sort=product(score, fieldVal)

How can I find the closest document using Google App Engine Search API?

I have approximately 400,000 documents in a GAE Search index. All documents have a location GeoPoint property and are spread over the entire globe. Some documents might be over 4000km away from any other document, others might be bunched within meters of each other.
I would like to find the closest document to a specific set of coordinates but find the following code gives incorrect results:
from google.appengine.api import search
# coords are in the form of a tuple e.g. (50.123, 1.123)
search.Document(
doc_id='meaningful-unique-id',
fields=[search.GeoField(name='location'
value=search.GeoPoint(coords[0], coords[1]))])
# find document function radius is in metres
def find_document(coords, radius=1000000):
sort_expr = search.SortExpression(
expression='distance(location, geopoint(%.3f, %.3f))' % coords,
direction=search.SortExpression.ASCENDING,
default_value=0)
search_query = search.Query(
query_string='distance(location, geopoint(%.3f, %.3f)) < %d' \
% (coords[0], coords[1], radius),
options=search.QueryOptions(
limit=1,
ids_only=True,
sort_options=search.SortOptions(expressions=[sort_expr])))
index = search.Index(name='document-index')
return index.search(search_query)
With this code I will get results that are consistent but incorrect. For example, a search for the nearest document to London indicated the closest one was in Scotland. I have verified that there are thousands of closer documents.
I narrowed the problem down to the radius parameter being too large. I get correct results if the radius is down to around 12km (radius=12000). There are generally no more than 1000 documents in a 12 km radius. (Probably associated with search.SortOptions(limit=1000).)
The problem is that if I am in a sparse area of the globe where there aren't any documents for thousands of miles, my search function will not return anything with radius=12000 (12km). I want it to return the closest document to me wherever I am. How can I accomplish this consistently with one call to the Search API?
I believe the issue is the following.
Your query will select up to 10K documents, then those are sorted according to your distance sort expression and returned. (That is, the sort is in fact not over all 400k documents.)
So I suspect that some of the geographically closer points are not included in this 10k selection.
That's why things work better when you narrow your search radius, as you have fewer total points in that radius.
Essentially, you want to get your query 'hits' down to 10k, in a manner that makes sense for what you are querying on.
You can address this in at least a couple of ways, which you can combine:
Add a ranking, so that the most 'important' docs (by some criteria that makes sense in your domain) are returned in rank order, then these will be sorted by distance.
Filter on one or more document field(s) (e.g., 'business category', if your docs contain information about businesses) to reduce the number of candidate docs.
(I don't believe this 10k threshold is currently in the Search API documentation; I've filed a ticket to get it added).
I have the exact same problem, and I don't think its possible. The problem happens as you yourself has figured out when there are more possible results than returned results. The Google algorithm just quits when it has loaded the limits and then it sorts the results.
I have seen the same clusters as you and its part of the search API.
One Hack would be to subdivide your search into sub-sectors, do multiple simultaneous calls and then merge and order the results.
Wild idea, why not keep/record the distance from 3 points then calculate from that.

Resources