Assume a list of books with an Author field. How might one facet on the Author field, but treat the values "Stephen King" and "Richard Bachman" as the same? So that these results:
Hemmingway: 8
Stephen King: 10
Edgar Allan Poe: 20
Richard Bachman: 5
Would be displayed as:
Hemmingway: 8
Stephen King: 15
Edgar Allan Poe: 20
Note that it is unimportant if the facet title is "Stephen King", "Richard Bachman", or something else. It is only important that they are faceted together.
Note that a query-time solution is needed. Unfortunately the schema cannot be changed for this index, it is a general-purpose index and if every user could make his own schema 'tweak' it would get out of hand.
You can achieve that by combining facet fields with facet queries.
Add these to your query:
&facet=true
&facet.field=author
&facet.query=author:("Hemmingway" OR "Stephen King")
Facets returned will look like this:
facet_counts: {
facet_queries: {
"author:("Hemmingway" OR "Stephen King")" : 18
}
facet_fields: {
author: {
"Hemmingway" : 8,
"Stephen King" : 10,
"Edgar Allan Poe" : 20,
"Richard Bachman" : 5
}
}
}
You can also add an 'alias' to the facet query. Change this
&facet.query=author:("Hemmingway" OR "Stephen King")
To
&facet.query={!ex=dt key="Hemmingway"}author:("Hemmingway" OR "Stephen King")
And the facet query output will be:
facet_queries: {
"Hemmingway" : 18
}
I'm not sure if you can merge both output fields (facet_queries and facet_fields) from Solr, but doing that from any client should be straight-forward.
You need an analysis chain that converts the strings. I think SynonymFilter will do this for you if you apply it at index time and at query time. You would need to make sure the sysnonym mapping goes one way only.
I assume you do not need the whole list of facets, just top n authors. If this is the case you can do it in a post processing step.
You know your synonyms and if you put a slightly higher facet.limit(let's say 2*n) then you just have to filter out the synonyms from the result set. If you end up with < n results then just repeat the previous step(worse case you have to do one more request(s) depending on the number of synonyms).
in ex ...&facet=true&facet.field=author&facet.limit=100&facet.mincount=1
This one has nothing to do with Solr, but considering all the restrictions it might just cut it.
Best regards,
Related
I have people indexed into solr based on structured documents. For simplicity's sake, let's say they have the following schema
{
personName: text,
games :[ { gamerScore: int, game: text } ]
}
An example of the above would be
{
personName: john,
games: [
{ gamerScore: 80, game: Zelda },
{ gamerScore: 20, game: Space Invader },
{ gamerScore: 60, game: Tetris},
]
}
'gamerScore' (a value between 1 and 100 to indicate how good the person is in the specified game).
Relevance matching in solr is all done through the Text field 'game'. However, I want my final result list to be a combination of relevance to the query as provided by solr and my own gamerScore. Namely, I need to re-rank the results based on the following formula:
personFinalScore = (0.8 * solrScore) + (0.2 * gamerScore)
What am trying to achieve is the combination of two different scores in a weighted manner in solr. This question was asked a long time ago, and was wondering if there is something in solr v7.x. that can tackle this.
I can change the schema around if a solution requires it.
In effect your formula can be simplified to applying your gamerScore with 0.25 - the absolute value of the score is irrelevant, just how much the gamerScore field affects the score of the document.
The dismax based handlers supports bf:
The bf parameter specifies functions (with optional boosts) that will
be used to construct FunctionQueries which will be added to the user’s
main query as optional clauses that will influence the score.
Since bf is an addtive boost, you can use bf=product(gamerScore,0.25) to make the gamerScore count 20% of the total score.
Solr experts -
I have two records in my solr database:
{
"keywords":["jaime kelly jkelly natixis sales and marketing manager"],
"job_role":"natixis sales & marketing manager",
"empl_name":["jaime kelly jkelly"],
},
{
"keywords":["schwayb jackson sjackson"],
"job_role":"portfolio manager",
"empl_name":["schwayb jackson sjackson"],
}
When I search on the field empl_name with the query:
empl_name:schwayb natixis
the first record returned is jaime kelly instead of schwayb jackson. This is weird. I am explicitly searching the field empl_name and among the two records the second one is the closer match. Why does Solr not order it correctly?
Looks like Solr sees the string "natixis" in the job_role and keywords of the earlier record and is giving it more preference. But I want solr to ONLY look at empl_name and no other field. How can I achieve this?
You need to group the query terms in parentheses. Or else it gets parsed as empl_name:schwayb OR natixis.
We are using Solr to store documents with keywords; each keyword is associated with a span within the document.
The keywords were produced by some fancy analytics and/or manual work prior to loading them into Solr. A keyword can be repeated multiple times in a document. On the other hand, different instances of the same string in a single document can be connected with different keywords.
For example, this document
Bill studied The Bill of Rights last summer.
could be accompanied by the following keywords (with offsets in parentheses):
William Brown (0:4)
legal term (13:31)
summer 2011 (32:43)
(Obviously in other documents, Bill could refer to Bill Clinton or Bill Gates. Similarly, last summer will refer to different years in different documents. We do have all this information for all the documents.)
I know the document can have a field, say KEYWORD, which will store William Brown. Then when I search for William Brown I will get the above document. That part is easy.
But I have no idea how to store the info that William Brown corresponds to the text span 0:4 so I can highlight the first Bill, but not the second.
I thought I could use TermVectors, but I am not sure if/how I can store custom offsets. I would think this is a fairly common scenario ...
EDIT: edited to make clear that Bill can refer to different people/things in different documents.
EDIT2: edited to make clear that a document can contain homonyms (identical strings with different meanings).
Two Q Monte
Solution Pros:
Annotations logically stored with source docs
No knowledge of highlighter implementation or custom Java highlighter development required
Since all customization happens outside of Solr, this solution should be forward-compatible to future Solr versions.
Solution Cons:
Requires two queries to be run
Requires code in your search client to merge results from one query into the other.
With Solr 4.8+ you can nest child documents (annotations) underneath each primary document (text)...
curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d '
[
{
"id": "123",
"text" : "Bill studied The Bill of Rights last summer.",
"content_type": "source",
"_childDocuments_": [
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
},
{
"id": "123-2",
"content_type": "source_annotation",
"annotation": "legal term",
"start_offset": 13,
"end_offset": 31
},
{
"id": "123-3",
"content_type": "source_annotation",
"annotation": "summer 2011",
"start_offset": 32,
"end_offset": 43
}
]
}
]
... using block join to query the annotations.
1) Annotation Query: http://localhost:8983/solr/query?fl=id,start_offset,end_offset&q={!child of=content_type:source}annotation:"William Brown"
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
}
]
}
Store these results in your code so that you can fold in the annotation offsets after the next query returns.
2) Source Query + Highlighting: http://localhost:8983/solr/query?hl=true&hl.fl=text&fq=content_type:source&q=text:"William Brown" OR id:123
(id:123 discovered in Annotation Query gets ORed into second query)
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123",
"content_type": "source",
"text": "Bill studied The Bill of Rights last summer."
}
],
"highlighting":{}
}
Note: In this example there is no highlighting information returned because the search terms didn't match any content_type:source documents. However we have the explicit annotations and offsets from the first query!
Your client code then needs to take the content_type:source_annotation results from the first query and manually insert highlighting markers into the content_type:source results from the second query.
More block join info on Yonik's blog here.
By default Solr stores the start/end position of each token once is tokenized, for instance using the StandardTokenizer. This info is encoded on the underline index. The use case that you described here sounds a lot like the SynonymFilterFactory.
When you define a synonym using the SynonymFilterFactory stating for instance that: foo => baz foo is equivalent to bar, the bar term is added to the token stream generated when the text is tokenized, and it will have the same offset information than the original token. So for instance if your text is: "foo is awesome", the term foo will have the following offset information (start=0,end=3) a new token bar(start=0,end=3) will be added to your index (assuming that you're using the SynonymFilterFactory at index time):
text: foo is awesome
start: 0 4 7
end: 3 6 13
Once the SynonymFilterFactory is applied:
bar
text: foo is awesome
start: 0 4 7
end: 3 6 13
So if you fire a query using foo, the document will match, but if you use bar as your query the document will also match since a bar token is added by the SynonymFilterFactory
In your particular case, you're trying to accomplish multi-term synonyms, which is kind of a difficult problem, you may need something more than the default synonym filter of Solr. Check this post from the guys at OpenSourceConnections and this other post from Lucidworks (the company behind Solr/Lucene). This two posts should provide additional information and the caveats of each approach.
Do you need to fetch the stored offsets for some later processing?
Suppose I want to create a recommendation system to suggest people you should connect with based off of certain attributes that I know about you and attributes I have about other people that are stored in a Solr index. Is it possible to query the index with a list of attributes (along with boosts for each attribute) and have Solr return scored results even if some of my fields return no matches? The way that I understand that Solr works is that if one of your fields doesn't contain a match in any documents found in your index, you get zero results for the entire query (even if other fields in the query matched) - is that right? What I would hope is that I could query the index and get a list of results back in order of a score given based on how many (and which) fields matched to something, even if some fields have no matches, for example:
Say that there are 2 people documents stored in the index as follows (figuratively):
Person 1:
Industry: Manufacturing
City: Oakland
Person 2:
Industry: Manufacturing
City: San Jose
And say that I perform a pseudo-Solr query that basically says "Search for everyone whose industry is equal to manufacturing and whose city is equal to Oakland". What I would like is to receive both results back in the result set, even though one of the "Persons" does not reside in Oakland. I just want that person to come back as a result with a lower score than Person1. Is this possible? What might a solr query look like to handle this? Assume that I have many more than 2 attributes for each person (so saying that I can use "And" and "Or" in my solr query isn't really feasible.. or is it?) Thanks in advance for your helpful input! (PS I'm using Solr 3.6)
You mention using the AND operator, which is likely your problem.
The default behavior of Lucene, and Solr, query syntax is exactly what you are asking for. A query like:
industry:manufacturing city:oakland
Will match either, with scoring preference on those that match both. See the lucene query syntax documentation
You can use the bq parameter (boost query) does not affect matching, but affects the scores only.
http://localhost:8983/solr/persons/select?q=industry:manufacturing&bq=City:Oakland^2
play with the boosting factor at the end to get the correct balance between matching score, and boosting score.
Is there a way to get the count of different facets in solr?
This is an example facet:
Camera: 20
Computer: 80
Monitor: 40
Laptop: 120
Tablet: 30
What I need to get here is "5" as in 5 different electronic items. Of course I can count the facet but some of them have many items inside and fetching and counting them really slow it down.
You need to apply Solr Patch SOLR-2242 to get the Facet distinct count.
In SOLR 5.1 will be a "JSON Facet API" with function "unique(state)".
http://yonik.com/solr-facet-functions/
Not an exact answer to this question, but if facet distinct count doesn't work for anyone, you can get that information using group.ngroups:
group = true
group.field = [field you are faceting on]
group.ngroups = true
The disadvantage of this approach is that if you want ungrouped results, you will have to run the query a second time.
Of course the patch is recommended, but the cheesy way to do this for a simple facet, as you have here, is just to use the length of the facet divided by 2. For example, since you are faceting on type:
facetCount = facet_counts.facet_fields.type.length/2
Use this GET request:
http://localhost:8983/solr/core_name/select?json.facet={x:"unique(electronic_items)"}&q=*:*&rows=0