I'm writing the csv file to train a ranker in Watson Retrieve and Rank service, with many rows [query,"id_doc","relevance_score",...].
I have two questions about the structure of this file:
I have to distinguish two documents, depending on whether or not the query contains the word "not". More specific:
the body and the title of the first document contain "manager"
the body and the title of the second document contain "not manager"
Thus, if the query is like "I'm a manager. How do I....?" then the first document is correct, but not the second one.
if the query is like "I'm not a manager..." then the second document is correct, but not the first one.
Is there any particular syntax that can be used to write the query in a proper way? Maybe using boolean operator? Is this file the right place to apply this kind of filter?
2. This service has also a web interface to train a ranker. The rating used in this site is: 1-> incorrect answer, 2-> relevant to the topic but doesn't answer to the question, 3-> good, but can be improved, 4->perfect answer.
Is the relevance score used in this file the same one of the web interface?
Thank you!
Is there any particular syntax that can be used to write the query in a proper way? Maybe using boolean operator? Is this file the right place to apply this kind of filter?
As you hinted, this file is not quite the appropriate place for using filters. The training data will be used to figure out what types of lexical overlap features the ranker should pay attention to when trying to optimize the ordering of the search results from Solr (see discussion here for more information: watson retrieve-and-rank - manual ranking).
That said, you can certainly add at least two rows to your training data like so:
The first can have the question text "I'm a manager. How do I do something" along with the corresponding correct doc id and a positive integer relevance label.
The second row can have the question text "I'm a not manager. How do I do something" along with the answering doc id for non-managers and a positive integer relevance label.
With a sufficient number of such examples, hopefully the ranker will learn to pay attention to bigram lexical overlap features. If this is not working, you can certainly play with pre-detecting manager vs not manager and apply appropriate filters, but I believe that's done with a separate parameter (fq?)...so you might have to modify train.py to pass the filter query appropriately (the default train.py takes the full query and passes it via the q to the /fcselect endpoint).
Is the relevance score used in this file the same one of the web interface?
Not quite, the web interface uses the 1-4 star rating to improve the UI for data collection, but then compresses the star ratings to a smaller relevance label scale when generating the training data for the ranker. I think the compression gives bad answers (i.e. star ratings < 3) a relevance label of 0 and passes the higher star ratings as is so that effectively there are 3 levels of rating (though maybe someone on the UI team can add clarification on the details if need be). It is important for the underlying ranking algorithm that bad answers receive a relevance label of 0.
Related
I've worked with solr some in the past, but mostly the searching has been straight forward. We've now got a situation where we'd like to have searches that can restrict results on an "AND" ... using the example as follows:
Doc 1 --> StudentID:123 ClassID: 001
Doc 2 --> StudentID:123 ClassID:002
Doc 3 --> StudentID:987 ClassID:001
The "English" version of the desired query would be "Give me all students in classes with classID:001 and ClassID:002. This would only return StudentID:123 and leave out Student:987.
Granted, our actually query is much more complex than this b/c the class could also have other properties like time, day, etc. But I wanted to see if I could get some help in accomplishing the basic "AND" filtering first.
This is how we are currently implementing it and it "seems" to work, but since the number of classes can be dynamic, it means we'll need to dynamically update the mincount. Just curious if there's a "better" way of doing it.
q=*:*&fq=(ClassID:001)OR(ClassID:002)&rows=0&group=true&group.field=ClassID&group.facet=true&group.ngroups=true&facet=true&facet.field=ClassID&facet.mincount=2&facet.field=StudentID
I'm sure there's a straight forward way that I haven't found yet, so I'm handing the question off to the experts. Help is appreciated!
You could set your default operator in your schema.xml to be OR. This assumes that all (or most) of the cases would want to do an OR querying.
Then you could change your query to be something like this:
classId:('001' '002')
Since your class Ids are dynamic, you could inject this value by joining a list of classids from your client .
I'm quite stuck with searching for a solution for my problem and I hope that you can maybe help me.
In general I want to build a small job platform. It includes an "Explore"-Section, which is just like a Search-Page with Facets.
The actual job-nodes can be tagged with terms of the two vocabulary "skills" and "interests".
The facets on the search page allow the user to filter jobs exactly along these skills and interests.
However, I want to use the "OR"-Operator for the Facets, so that the user gets a list with jobs, that nearly perfect match their skills & interest but also jobs that match only some of these terms.
So, here you can see the default listing page. On the left are the Facets for interest and type (Operator "OR"). On the right, you can see the result set with title, and the node's skills & interest terms:
See the image of the Jobsearch Default page
Now, I'm applying "Musik" and "Kultur" as interest-filters:
See the image of the Jobsearch with applied filters
As you can see in the result-set, the OR-operator delivers all the results.
However, I would like to sort these results according to their "relevance" resp. according to the count of matched criterias.
The 4. and 5. results match both terms, that are selected in the facet, but they should be listed in front of all other terms.
So, I hope you understand what I want to achieve. I started at first with Views to accomplish the goal, but I then switched to search_api and SOLR as I think, that this approach is more enhanceable in the future.
The second aim is, that a user can store his/her individual interests & skills (the filters mentioned before) in his user profile. Here, the user should see individual job recommendations based on his profile on his account-page.
So, any hints, tips, tricks, links are very welcome as I have no idea if I'm on the right track to solve my problem(s). :)
Robert
Maybe this approach could be an alternative:
Instead of using the tags as facets/filters, I could use them just as search input.
when i'm typing my terms/tags within the search field of an apache-sold-search-page, i'm getting exactly the results sorted by their relevance:
Searching the tags instead of filtering
So, maybe I have just to do a small piece of code, that automatically creates a search query based on the clicked term/tagsā¦
Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.
For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails
When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.
Lets say I have a Photo class containing a multi-valued property for tags and a date field.
I would like to allow the user to perform a query based on tags (using only a AND operator for more then 1 tag).
For example lets say a user searches for a rainy day.
Select * from Photo where tag='clouds' AND tag='rainy'
How does the zig-zag merge work? I know that two scans are performed, and based on if the keys from both searches point to the same Photo then it's returned. Does this happen in parallel however? Ex: While Search 1 finds a photo that contains tag 'clouds' , Search 2 is finding the first photo that contains tag "rainy". When both searches are done, it becomes synchronous. Search 1 then continues it's scan until it hits the same key as S2. Then while the keys for each search are the same, the photo is returned, and the "cursor" is moved along 1 step for each search?
Secondly, does defining multiple indexes speed up these sort of queries? Ex, if I wanted to allow up to 4 tags then I would need to define the indexes such as:
Index(Photo)
Index(Photo, tag)
Index(Photo, tag,tag)
Index(Photo, tag,tag,tag)
Index(Photo, tag,tag,tag,tag)
Then, performing the same query above will be quicker?
Also, using our original query, lets say we have Millions of photos tagged as cloudy, but only two are tagged as rainy. Does this mean zig-zag will perform relatively slow? Since one of the searches will try to find a matching exist? Even worse, if we have one million photos tagged "rainy" and one million are tagged "cloudly" yet no single photo have both tags in them. Will defining the above index's fix this issue?
Lastly, lets say a photo has 100 tags. Does that mean all the index's above have to include EVERY combination of the 100 tags?
I know there are got-yas (such as a entity can only be indexed 5000 times, and a single multi-valued property can only be indexed a 1000 times).
How does the zig-zag merge work?
You can check out the Google I/O video from 2009 on Building Scalable, Complex Apps on App Engine. Brett Slatkin explains how zig-zag merge works starting at 27 minutes. As he says, "I can't really explain it without showing how it works."