Solr5 search not displaying results based on score - solr

I am implementing Solr search, the search order is not displaying on the basis of score. Lets say if use the search keywords as .net ios it's returning the results based on score. I have a field title which holds the following data
KeySkills:Android, ios, Phonegap, ios
KeySkills:.net, .net, .net, MVC, HTML, CSS
Here when i search .net ios as search keyword net, .net, .net, MVC, HTML, CSS should come first in the results and the score should be higher because it contains .net 3 times, but i am getting reverse result.
Is there any setting needs to be done in solr config file or in schema.xml file to achieve this or how can i sort the results based on max no of occurrence of the the search string. please help me to solve this.
Following is the result i get
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"indent": "true",
"q": ".net ios",
"_": "1434345788751",
"wt": "json"
}
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"KeySkills": "Android, ios, Phonegap, ios",
"_version_": 1504020323727573000,
"score": 0.47567564
},
{
"KeySkills": "net, net, net, MVC, HTML, CSS",
"_version_": 1504020323675144200,
"score": 0.4726259
}
]
}
}

As you can see in Lucene's doc, score is not only estimated with the number of matching term:
score(q,d) = coord(q,d) · queryNorm(q) · ∑( tf(t in d)·
idf(t)²·t.getBoost()·norm(t,d) )
Where
tf(t in d) correlates to the term's frequency, defined as the number
of times term t appears in the currently scored document d.
idf(t) stands for Inverse Document Frequency. This value correlates
to the inverse of docFreq (the number of documents in which the term t
appears). This means rarer terms give higher contribution to the total
score.
coord(q,d) is a score factor based on how many of the query terms
are found in the specified document.
t.getBoost() is a search time boost of term t in the query q as
specified in the query text.
norm(t,d) encapsulates a few
(indexing time) boost and length factors:
Field boost
lengthNorm
computed when the document is added to the index in accordance with
the number of tokens of this field in the document, so that shorter
fields contribute more to the score.
When a document is added to the index, all the above factors are
multiplied. If the document has multiple fields with the same name,
all their boosts are multiplied together:
norm(t,d) = lengthNorm · ∏ f.boost()
So, here I guess that "KeySkills": "Android, ios, Phonegap, ios" is before your other document because it contains less words than the other one.
To check that, you can use this awesome tool, which is explain.solr.pl.

Related

How to display the individual score for search results in solr?

I am testing Solr 9.0 with this tutorial:
https://solr.apache.org/guide/solr/latest/getting-started/tutorial-techproducts.html
I used this query:
http://localhost:8983/solr/techproducts/select?q=cat:electronics&fl=name
In the results displayed, it only gives a masScore. How to display each individual score for each result?
"response": {
"numFound": 12,
"start": 0,
"maxScore": 0.5244061,
"numFoundExact": true,
"docs": [
{
"name": "Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133"
},
{
"name": "Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300"
},
{
"name": "Belkin Mobile Power Cord for iPod w/ Dock"
},
You can read individual document scores as an additional field in the results via the fl (Field List) parameter.
The fl parameter limits the information included in a query response
to a specified list of fields. The fields must be either stored="true"
or docValues="true".
The field list can be specified as a space-separated or
comma-separated list of field names. The string score can be used to
indicate that the score of each document for the particular query
should be returned as a field. The wildcard character * selects all
stored fields in the document.
http://localhost:8983/solr/techproducts/select?q=cat:electronics&fl=name,score

How to customize tokenization of numbers by the en.microsoft analyzer?

We are now using Azure search Microsoft language analyzers on some of language specific fields. In most of cases, it has better relevance than standard Lucene language analyzers. But we found an issue when verifying en.microsoft analyzer.
The problem is, if the field value contains digits. The analyzer is smart to allow redundant “0” in front the digit.
For example:
POST /analyze?api-version=2017-11-11
{
"text": "1",
"analyzer": "en.microsoft"
}
The response is:
"tokens": [
{
"token": "1",
"startOffset": 0,
"endOffset": 2,
"position": 0
},
{
"token": "nn1",
"startOffset": 0,
"endOffset": 2,
"position": 0
}
]
The problem is, that if the field value is “01”, then all text like “01”, “001”, “0001”, … will match that field.
We have a field to save the product attribute name/value pairs, for example, “brand:Contoso|size:1”. Then even searching “0001” can return the document with this field value. This is not what we want.
So, my question is, is there any way to customize the en.microsoft analyzer so that, we can take advantage of the powerful stemmer of the analyzer but avoid the auto “0” padding in front of the digit?
Unfortunately, you can't change how the Microsoft tokenizers normalize numbers. To workaround this limitation you could choose a different analyzer for product attributes or add a character filter to your analyzer configuration that encodes the numeric characters so the tokenizer ignores them, for example, map each digit into a character from outside your expected character set using the MappingCharFilter.You can find examples here, use MicrosoftLanguageStemmingTokenizer as your tokenizer.

Does it make sense to model a Solr document if search does not allow to specify the attributes?

I want to provide a search feature in my site where the user is able to search by text only, without specifying the attributes.
For example, instead of allowing the user to search by "author=George Martin" he will simply query "George Martin".
I would like to know if there is any advantage in a document model like this one:
{
"id": 1,
"title": "Game of Thrones",
"author": "George R. R. Martin",
"published": "August, 1996"
}
Compared to:
{
"id": 1,
"data": [
"Game of Thrones",
"George R. R. Martin",
"August, 1996"
]
}
If I'm not going to use "author:value" in the Solr API, I should get the same results, right?
The first version will allow you to assign different weights to the different fields. I.e. a hit in the title might be more important than a hit in the author field - or vice versa.
Using the edismax handler (defType=edismax) and query fields (qf=title author published) will give you the same behavior as your second example, but will retain the structure of the document.
As the fields are put into the qf parameter, there is no need for the user to explicitly tell Solr which fields she wants to search.
To give the fields different weights, assign a weight to the field in the qf list: qf=title^5 author^2 published will give a hit in title five times the weight than a hit in published - i.e. "The Hunt for Red October" will be more important than something published in October.

Solr 7 - How to do Full Text Search w/ Geo Spatial Search

How to do Full Text combined w/ Geo Spatial in Solr 7?
In regards to this: https://lucene.apache.org/solr/guide/7_2/spatial-search.html
I have to do queries that COMBINE full text w/ geo spatial. For example:
box AND full text or spatial sort AND full text.
I was not able to figure out a good query string example that produces this desired result. I would like this as a pure query string rather than some Java method as I'm consuming this on different tech other than Java. Solr is very deep and confusing and I know I must read more but there was no good examples for this anywhere online that I found.
desired query string example
[solr]/select?fq={!bbox sfield=point}&pt=34.04506799999999,-118.260849&d=10000&sort=geodist() asc&{!geofilt}&sfield=point&q=Panini
So in that case, would sort by distance yet also filter by some full text "some text" value.
If this cannot be done, I believe it is possible in Elastic Search but these (Solr and Elastic Search) are both built on top of Lucene so seems like it should work on both if works on one but feel free to supply an answer for Elastic Search as well.
example returned
{
"responseHeader": {
"status": 0,
"QTime": 2,
"params": {
"q": "Panini",
"pt": "34.04506799999999,-118.260849",
"d": "10000",
"{!geofilt}": "",
"fq": "{!bbox sfield=point}",
"sort": "geodist() asc",
"sfield": "point"
}
},
"response": {
"numFound": 0,
"start": 0,
"docs": []
}
}
Docs do contain this phrase 'Panini' but none returned. May be due to default way full text is handled in Solr 7? It is using the same point where the term 'Panini' is used and the field point is of type org.apache.solr.schema.PointType.
UPDATE
I ended up abandoning Solr for Elastic Search. Solr is just very annoying in its strange ways compared with the very easy to use Elastic Search. Things just work as you expect without having to dig into quirks.
I adapted my answer to the solr 7.2.1 example:
Start solr by: ./bin/solr start -e techproducts
I've also visualized the data in google maps:
https://www.google.com/maps/d/u/0/viewer?ll=42.00542239270033%2C-89.81213734375001&hl=en&hl=en&z=4&mid=16gaLvuWdE9TsnhcbK-BMu5DVYMzR9Vir
You need these query parameters:
Bound by Box Filter:
fq={!bbox}
The geo filter query parser bbox needs further parameters:
Solr field: sfield=store
Point to search/sort from: pt=36.35,-97.51
Distance for filter: d=1200
Sort:
sort=geodist() asc
Fulltext query:
q=some+text
Full example queries for solr example data:
Simple:
http://localhost:8983/solr/techproducts/select?fq={!bbox}&sort=geodist()%20asc&sfield=store&pt=36.35,-97.51&d=1200&q=ipod&fl=name,store
UI:
http://localhost:8983/solr/techproducts/browse?fq={!bbox}&sort=geodist()%20asc&sfield=store&pt=36.35,-97.51&d=1200&q=ipod
The result is as expected:
Apple 60 GB iPod
Belkin Power Cord for iPod
Filtered by distance: iPod & iPod Mini USB 2.0 Cable
Hints
The field store must be of type location:
You might Urlencode the special characters:
e.g. fq=%7B%21bbox%20sfield%3DgeoLocation%7D
In your case, you have to combine the full-text search scoring with the spatial distance.
So if your query looks like this:
/select?fq={!bbox sfield=point}&pt=34.04506799999999,-118.260849&d=10000&sort=geodist() asc&{!geofilt}&sfield=point&q=Panini
You should change the sort parameter and either remove it or just set it to score desc. That way you sort by the score given from the full-text search query.
To take the spatial part into consideration you need to include a boosting function to your query. In majority of the cases - the closer the document is from the point of interest the better, so you would probably like to include a boosting function that does X/distance. The X can be as simple as 1 and the function itself can also be more complicated. To do that in dismax query you would use the bf parameter, like bf=div(1,geodist()).
Try that out, it should work, but of course will need some adjustments.

Lucene/Solr: Store offset information for certain keywords

We are using Solr to store documents with keywords; each keyword is associated with a span within the document.
The keywords were produced by some fancy analytics and/or manual work prior to loading them into Solr. A keyword can be repeated multiple times in a document. On the other hand, different instances of the same string in a single document can be connected with different keywords.
For example, this document
Bill studied The Bill of Rights last summer.
could be accompanied by the following keywords (with offsets in parentheses):
William Brown (0:4)
legal term (13:31)
summer 2011 (32:43)
(Obviously in other documents, Bill could refer to Bill Clinton or Bill Gates. Similarly, last summer will refer to different years in different documents. We do have all this information for all the documents.)
I know the document can have a field, say KEYWORD, which will store William Brown. Then when I search for William Brown I will get the above document. That part is easy.
But I have no idea how to store the info that William Brown corresponds to the text span 0:4 so I can highlight the first Bill, but not the second.
I thought I could use TermVectors, but I am not sure if/how I can store custom offsets. I would think this is a fairly common scenario ...
EDIT: edited to make clear that Bill can refer to different people/things in different documents.
EDIT2: edited to make clear that a document can contain homonyms (identical strings with different meanings).
Two Q Monte
Solution Pros:
Annotations logically stored with source docs
No knowledge of highlighter implementation or custom Java highlighter development required
Since all customization happens outside of Solr, this solution should be forward-compatible to future Solr versions.
Solution Cons:
Requires two queries to be run
Requires code in your search client to merge results from one query into the other.
With Solr 4.8+ you can nest child documents (annotations) underneath each primary document (text)...
curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d '
[
{
"id": "123",
"text" : "Bill studied The Bill of Rights last summer.",
"content_type": "source",
"_childDocuments_": [
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
},
{
"id": "123-2",
"content_type": "source_annotation",
"annotation": "legal term",
"start_offset": 13,
"end_offset": 31
},
{
"id": "123-3",
"content_type": "source_annotation",
"annotation": "summer 2011",
"start_offset": 32,
"end_offset": 43
}
]
}
]
... using block join to query the annotations.
1) Annotation Query: http://localhost:8983/solr/query?fl=id,start_offset,end_offset&q={!child of=content_type:source}annotation:"William Brown"
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
}
]
}
Store these results in your code so that you can fold in the annotation offsets after the next query returns.
2) Source Query + Highlighting: http://localhost:8983/solr/query?hl=true&hl.fl=text&fq=content_type:source&q=text:"William Brown" OR id:123
(id:123 discovered in Annotation Query gets ORed into second query)
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123",
"content_type": "source",
"text": "Bill studied The Bill of Rights last summer."
}
],
"highlighting":{}
}
Note: In this example there is no highlighting information returned because the search terms didn't match any content_type:source documents. However we have the explicit annotations and offsets from the first query!
Your client code then needs to take the content_type:source_annotation results from the first query and manually insert highlighting markers into the content_type:source results from the second query.
More block join info on Yonik's blog here.
By default Solr stores the start/end position of each token once is tokenized, for instance using the StandardTokenizer. This info is encoded on the underline index. The use case that you described here sounds a lot like the SynonymFilterFactory.
When you define a synonym using the SynonymFilterFactory stating for instance that: foo => baz foo is equivalent to bar, the bar term is added to the token stream generated when the text is tokenized, and it will have the same offset information than the original token. So for instance if your text is: "foo is awesome", the term foo will have the following offset information (start=0,end=3) a new token bar(start=0,end=3) will be added to your index (assuming that you're using the SynonymFilterFactory at index time):
text: foo is awesome
start: 0 4 7
end: 3 6 13
Once the SynonymFilterFactory is applied:
bar
text: foo is awesome
start: 0 4 7
end: 3 6 13
So if you fire a query using foo, the document will match, but if you use bar as your query the document will also match since a bar token is added by the SynonymFilterFactory
In your particular case, you're trying to accomplish multi-term synonyms, which is kind of a difficult problem, you may need something more than the default synonym filter of Solr. Check this post from the guys at OpenSourceConnections and this other post from Lucidworks (the company behind Solr/Lucene). This two posts should provide additional information and the caveats of each approach.
Do you need to fetch the stored offsets for some later processing?

Resources