I want to use Solr for my website as a search engine and I am trying to understand the difference between basic paging and deep paging with cursor marker.
As far as I understand, if you use the basic pagination and query the page 1001 with 20 results per page this will happen:
Solr will find the first 1000*20 matching results
display the next 20 results for 1001 page
I guess the problem is when someone clicks next page. Solr will find first the 1001*20 results and after that will show the desired results.
I haven't seen a proper example for deep paging with large numbers. Only with small numbers, so I am not sure about this. Can someone clarify it please?
Is the following example correct?
.../query?q=id:book*&sort=pubyear_i+desc,id+asc&fl=title_t,pubyear_i&rows=1&cursorMark=*
This giving me the "nextCursorMark" : "AoJcfCVib29rMg=="
Now that I have the nextCursorMark I can go and find my desired page.
Should I now go through the pages manually? Should I create a loop where I search for that particular page I want?
Or should I have the first query with 20000 rows, get the nextCursorMark and then use it with another query having only 20 rows?
I find it a bit strange to run some query with 20000 rows just to get the nextCursorMark. Is it the correct way to do it?
And what if, for example you have 10 pages and the user wants to click on page 5 from page 1. Will I need to go through each page manually to get there?
Edit:
I have read this: How to manage "paging" with Solr?
And this: https://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
Tried to find a working example but couldn't.
The cursorMark tells Solr where it should start the next response. It's analogous to the start parameter in your first example. As you're paginating through the results, each response's cursorMark shows where the next page starts.
If you're just looking for "what is the first result on page 1001", first version will work just fine. If you're paginating through the results - were a user may or may not go to the next page, the point about using cursorMarks is that each node (or in a single node setup) know which document was the last one to be shown, and thus, can return only rows number of documents from the current position for each node. If you'd do the first version, each node would have to return start + rows documents. So instead of trying to find out "which documents are the ten ones after 20001", you just need to answer "which documents are the next ten after this sort key".
In addition cursorMarks handles updates to the result set better, as you avoid any changes to the result set that would push documents that have already been shown back into the next page you're displaying.
See the reference guide for complete examples and further descriptions.
Related
I am trying to index and search a wiki on our intranet using Solr. I have it more-or-less working using edismax but I'm having trouble getting main topic pages to show up first in the search results. For example, suppose I have some URLs in the database:
http://whizbang.com/wiki/Foo/Bar
http://whizbang.com/wiki/Foo/Bar/One
http://whizbang.com/wiki/Foo/Bar/Two
http://whizbang.com/wiki/Foo/Bar/Two/Two_point_one
I would like to be able to search for "foo bar" and have the first link returned as the top result because it is the main page for that particular topic in the wiki. I've tried boosting the title and URL field in the search but the fieldNorm value for the document keeps affecting the scores such that sub-pages score higher. In one particular case, the main topic page shows up on the 2nd results page.
Is there a way to make the First URL score significantly higher than the sub categories so that it shows up in the top-5 search results?
One possible approach to try:
Create a copyField with your url
Extract path only (so, no host, no wiki)
Split on / and maybe space
Lowercase
Boost on phrase or bigram or something similar.
If you have a lot of levels, maybe you want a multivalued field, with different depth (starting from the end) getting separate entries. That way a perfect match will get better value. Here, you should start experimenting with your real searches.
I'm quite stuck with searching for a solution for my problem and I hope that you can maybe help me.
In general I want to build a small job platform. It includes an "Explore"-Section, which is just like a Search-Page with Facets.
The actual job-nodes can be tagged with terms of the two vocabulary "skills" and "interests".
The facets on the search page allow the user to filter jobs exactly along these skills and interests.
However, I want to use the "OR"-Operator for the Facets, so that the user gets a list with jobs, that nearly perfect match their skills & interest but also jobs that match only some of these terms.
So, here you can see the default listing page. On the left are the Facets for interest and type (Operator "OR"). On the right, you can see the result set with title, and the node's skills & interest terms:
See the image of the Jobsearch Default page
Now, I'm applying "Musik" and "Kultur" as interest-filters:
See the image of the Jobsearch with applied filters
As you can see in the result-set, the OR-operator delivers all the results.
However, I would like to sort these results according to their "relevance" resp. according to the count of matched criterias.
The 4. and 5. results match both terms, that are selected in the facet, but they should be listed in front of all other terms.
So, I hope you understand what I want to achieve. I started at first with Views to accomplish the goal, but I then switched to search_api and SOLR as I think, that this approach is more enhanceable in the future.
The second aim is, that a user can store his/her individual interests & skills (the filters mentioned before) in his user profile. Here, the user should see individual job recommendations based on his profile on his account-page.
So, any hints, tips, tricks, links are very welcome as I have no idea if I'm on the right track to solve my problem(s). :)
Robert
Maybe this approach could be an alternative:
Instead of using the tags as facets/filters, I could use them just as search input.
when i'm typing my terms/tags within the search field of an apache-sold-search-page, i'm getting exactly the results sorted by their relevance:
Searching the tags instead of filtering
So, maybe I have just to do a small piece of code, that automatically creates a search query based on the clicked term/tagsā¦
I want to implement click-through relevancy ranking in a search (solr). Basically depending on the users' feedback (which are clicks), we want to change the ordering of search results. Following is my approach.
We will add a new field to document to index the queries for which result/document has been accessed (or clicked). Whenever a result is clicked, we will update the index to include the query for which the result has been clicked. We will use solr's partial updates to add the new query to the index. Since, we use index as our data-store as well, all our fields are stored and I can afford to store one more field.
Is this the right approach to implement this feature?
Note: I, yet have to evaluate logging, and it is (yet) away from implementing it. I was just building a requirement specification to start with, which I formulated.
It is as follows.
Evaluate user selection (Click through) for `query` and matched result position.
The position is important because it determines the relevancy.
I chose the top results to be 3. (Assume N=3).
If users are selecting something that has a N>3, it is important to increase this result boost for the query.
If the position is at N<=3, we're good.
If position is consistantly at N<=3, demote the top results (maybe?)
However, we may get a lot of wrong info, here. Assume, a single user went crazy and clicks absolutely irrelevant results.
So we need to monitor usage, and log even user events, apart from just the basic position and click through to cover this.
So, log needs to be on :
Clicks results per page per {user-login|session}.
Click on result for {Query + Filters + Facets}. A special flag for {did you mean... | autocomplete} click events, with {TimeStamp + Location}
If a significant number of unique users indicate clicking on low score documents during a time range (months), I would boost the documents according to location.
Since we even have co-related a user session(login), I might be able to map results according to the user (if irrelevant noise generated by user, send it back to him ;P).
However, I would try my best not to put in too much boost. The search may look tampered.
Also a feedback form for the users to fill in might be a good idea to see how well you are going.
Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
I am using LucidWorks and Solr to implement search in a large and diverse web app which has many different types of pages. The spec calls for a single search results page grouped by page type with pagination of search results in each group.
I can group easily enough with something like this
q=[searchterm]&group=true&group.field=[pagetypefield]
which returns nicely grouped results.
I can also do:
q=[searchterm]&group=true&group.field=[pagetypefield]&group.offset=[x]&group.limit=[y]
which will get me y results per group starting at result x
However what i want to be able to do is supply an offset and limit per group because i might want to get results 0-4 for group 1 and results 5-9 for group 2.
The values for [pagetypefield] are a list of known values so i can do multiple queries like:
q=[searchterm]&group=true&group.query=[pagetypefield]:[value]&group.offset=[x]&group.limit=[y]
for each known value of [pagetypefield]
or to not use group.offset and in my example get results 0-9 for both groups and just discard the results i don't need.
I don't really like either option but i can't find a way in the documentation to specify offset and limit on a per group basis.
Any advice would be most appreciated.
I have confirmation from LucinWorks that what i want to do is not possible and they recommended the multiple search solution as the first search will be chached so subsequent searches will be really fast.
What I think I'm going to end up doing is to group the search results taking first n results for each group, then use ajax to paginate each group.