I have a solr index accessed using SolrNet, and I would like to retrieve the index (position) of a particular document in the results, without returning the whole result set.
In more detail... the query returns ~30,000 documents and is ordered by an integer field. The unique key field contains a Guid, and I would like to find where in the results a particular document is, based on the unique key, while only returning the first 10 results.
This index was originally implemented in plain old Lucene, and this task was achieved with two queries, one to get the lucene doc id of the document I want to know about, then a second that returns the whole result set. I can then use the doc id to find where the document appears in the full result set, but then only enumerate the documents for the first 10.
Is there a way to achieve what I'm after with Solr, without returning all 30000 results (even limiting this to the Guid only seems too slow)?
Thanks
I think you can do with a range query, including your user point's as lower level you can get the number of users above. You can do an explicit query or get that info using the facet range approach.
so, if you know that you user point is 10.000, you could do a query: game:tetris points[MaxInt TO 10000], and if the result is 375 you would know that your user is in rank 375.
The only reliable way I can think of is building the ranking at index-time, i.e. have a "rank" integer field and populate it when you build the index. The downside of this is that every update requires rebuilding the whole index.
Lucene doc ids are not stable, I wouldn't recommend using them for this (see this, this, this), and Solr does not expose them anyway.
Related
In some databases if you don't include sorting in the query, the database may access the same query results each time in different order. So if you are doing paging by sending multiple queries with different start position you may get the same raws multiple times.
Is it the same with Solr?
If I'm iterating all documents by changing the start parameter do I need to include some sorting field?
Documents are by default returned in the order they're added to the index. If a document is updated, it's effectively deleted and re-added, so it appears at the end of the index. If you're actually searching (and not just using fq), the score will be the same through each page of the result set (and the result set is sorted by score). If the index is updated, the score might change (as you'd expect).
So no, Solr and pagination does not require sorting. If you change the index while paginating, the results will change - just as it would if you sorted on an arbitrary field and added values that lands within the interval you're displaying.
To use the cursor support ("cursorMark" or deep paging), you'll have to have the uniqueKey of the collection in the sort (to make the sort deterministic for identical values), but that's not required for queries without a particular sort.
I need to synchronize a Solr index with a database table. At any given time, the Solr index may need to have documents added or removed. The nature of the database prevents the Data Import Handler's Delta Import functionality from being able to detect changes.
My proposed solution was to retrieve a list of all primary keys of the database table and all unique keys of the Solr index (which contain the same integer value) and compare these lists. I would use SolrJ for this.
However, to get all Solr documents requires the infamous approach of hard-coding the maximum integer value as the result count limit. Using this approach seems to be frowned upon. Does my situation have cause to ignore this advice, or is there another approach?
You can execute two queries to list all keys from solr in one batch: first with rows=0, you will get a number of hits, second with that number as rows parameter. Its not very optmimal solution, but works.
Second possibility is to store update date in solr index, and fetch only changed documents from last synchronisation.
I need to navigate forth and back in Solr results set ordered by score viewing documents one by one. To visualise that, first a list of document titles is presented to user, then he or she can click one of the title to see more details and then needs to have an opportunity to move to the next document in the original list without getting back and clicking another title.
During viewing documents get changed: their dynamic field is modified (or created is not exists yet) to mark that document has already been viewed (used in other search).
The problem I face is that when the document is altered and re-indexed to keep those changes, sometimes (and not always, which is very disturbing) its place in the results set for the same query changes (in other words, it's score changes as that doesn't happen when browsing results sorted by one of the documents' fields). So, "Previous" / "Next" navigation doesn't work properly.
I'm not using any custom weighting or boosters on fields for score calculation. Also, that dynamic field changed during browsing doesn't participate in the query used to get the record set browsed.
So, the questions are: can the modification of the document's field not included in the query change its relevance score? And if it can, then how can I control that?
UPDATE
I did some tests and can add the following:
Document changes its place in the result set even if no field is amended - just requesting the document and re-indexing it without any changes to its fields makes it take another place next time the same query over the same index is executed.
That happens even if the result set is sorted explicitly ("first_name DESC"), so score (which depends on the update date) is not involved. The document stays the same, its field result set is sorted by is the same, yet its position changes.
Still have no idea how to avoid that.
In Solr, if your field is "indexed", it will have an effect on the relevancy ranking ("stored" fields show up in search results but are not necessarily searchable). If the fields in question aren't marked as indexed then you are good to go. Note that "indexed" and "stored" are not necessarily the same, hence you confusion about results lists changing even though not all fields are shown (a field can be "indexed" and not "stored" as well).
In this case I think you want your "viewed" field to be "stored" but not "indexed". If you really want to control the query, you can use copyField to copy the relevant results into a single searchable field. You can also boost terms or documents so that certain fields are "less important" to the search query.
If you want to see how the relevancy rankings are calculated, you can add "debugQuery=on" to the end of your Solr Query (see the Relevancy FAQ for more info).
However, all that being said, I would recommend you cache your search result query (at least for the first page for your results), since you will always have results changing (documents added, removed by other users, etc). Your best bet is to design a UI that anticipates this, or at least batches a user's query.
I've found the solution which doesn't eliminate the problem completely but makes it much less likely to happen.
So the problem happens when the documents are sorted by some field and there is a number of them with the same value in this field (e.g. result set is sorted by first name, and there are 100 entries for "John").
This is when the indexed time gets involved - apparently Solr uses it to sort the documents when their main sorting fields are identical. To make this case much less probable, you need to add more sorting fields, e.g. "first_name desc" should become "first_name desc, last_name desc, register_date asc".
Also, adding document's unique id as the last sorting field should remove the problem completely (the set of sorting fields will never be identical for any two documents in the index).
I use Lucene to index my documents and search. Actually I have 800k documents indexed in Lucene. Those documents have some fields:
Id: is a Numeric field to index the documents
Name: is a textual field to be stored and analyzed
Description: like name
Availability: is a numeric field to filter results. This field can be updated frequently, every day.
My question is: What's the better way to create a filter for availability?
1 - add this information to index and make a lucene filter.
With this approach I have to update document (remove and add, because lucene 3.0.2 not have update support) every time the "availability" changes. What the cost of reindex?
2 - don't add this information to index, and filter the results with a DB select.
This approach will do a lot of selects, because I need select every id from database to check availability.
3 - Create a separated index with id and availability.
I don't know if it is a good solution, but I can create a index with static information and other with information can be frequently updated. I think it is better then update all document, just because some fields were updated.
I would stay away from 2, if you can deal only with the search in lucene, instead of search in lucene+db, do it. I deal in my project with this case (Lucene search + DB search), but I do it cause there is no way out of it.
The cost of an update is internally:
delete the doc
insert new doc (with new field).
I would just try approach number 1 (as is the simplest), if the performance is good enough, then just stick with it, if not then you might look ways to optimize it or try 3.
Answer provided from lucene-groupmail:
How often is "frequently"? How many updates do you expect to do in
a day? And how quickly must those updates be reflected in the search
results?
800K documents isn't all that many. I'd go with the simple approach first
and monitor the results, #then# go to a more complex solution if you
see a problem arising. Just update (delete/add) the documents when
the value changes.
Well, the cost to reindex is just about what the cost to index it orignally
is. The old version of the document is marked deleted and the new one
is added. It's essentially the same cost as to index a new document.
This leaves some gaps in your index, that is the deleted docs are still in
there, but the next optimize will compact them.
From which you may infer that optimizing is the expensive part. I'd do that,
say
once daily (or even weekly).
HTH
Erick
I wonder if there is a way to limit the spellchecking to just a part of the index.
Example i have an index containing different products used in different countries.
when a search is performed i limit the solr query to just return the results for COUNTRY X, however the suggestions that are returned are not limited to COUNTRY X, instead i receive results based on the whole index(since i only have one mispell index).
i beleive you can create a separate dictionary one for each country to solve this but here is the twist, i sometimes do a query where i want results back from COUNTRY_X and COUNTRY_Y thus also suggestions limited by those 2 countries, this would in turn result in a dictionary index of its own, seems a little to complicated and the number of dictionary indexes would be large.
I'd try splitting the index per country, i.e. one index for country X and another for country Y. You can easily do this with a multi-core setup. This way each index gets its own dictionary.
When you want to search on multiple countries at once you run a distributed query over the indexes. Distributed support for the spell checking component is only available in trunk as of this writing though.