Is there a way sort documents in Solr by the number of fields in each document?
The solr core in question has about 200 different fields, while not every field must be present in every doucment. To circle out datasets that contain not enough fields to be correct, I'd like to work through a *:* query sorted from the lowest number of fields per documents upwards.
I didn't find anything on this specific use case. Most results I found were about the relevance of individual fields, however this might not help here given the large field spectrum of the core.
It might be possible by sorting on a function query. That function would return a value that would be higher the more fields the doc has. But I am afraid that function would be huge (and slow), as it would need to enumerate all fields in the function.
By far the easiest thing would be to, at index time, add a 'nbFields' field containing the number of fields. Then you can sort easily on that one.
Related
This question is a follow-up question to my previous question: Is child documents in solr anti-pattern?
I am creating a new question on dynamic field performance as I did not find any recent relevant posts on this topic and felt it deserved a separate question here.
I am aware that dynamic fields are treated as static fields and performance-wise both are similar.
Further, from what I have read, in terms of memory, dynamic fields are not efficient. Say, if a document has 100 fields and another has 1000(max number of fields in the collection), Apache Solr will allocate the memory block to support all 1000 fields for all the documents in the collection.
I have a requirement where I have 6-7 fields that could be part of child documents and each parent document could have up to 300 child documents. Which means each parent document could have ~2000 fields.
What will be the performance impact on queries when we have such a large number of fields in the document?
That really depends on what you want to do with the field and what the definition of these fields are. With docValues, most earlier issues with memory usage for sparse fields (i.e. fields that only have values in a small number of the total number of documents) are solved.
Also, you can usually rewrite those dynamic fields to a single multiValued field for filtering instead of filtering on each field (i.e. common_field:field_prefix_value where common_field contains the values you want to filter on prefixed with a field name / unique field id).
Anyway, the last case is that it'll depend on how many documents you have in total. If you only have 1000 documents, it won't be an issue in any way. If you have a million, it used to be - depending on what you needed those dynamic fields for. These days it really isn't an issue, and I'd start out with the naive, direct solution and see if that works properly for your use case. It's rather hard to say without knowing exactly what these fields will contain, what the use case for the fields are, what they'll be used for and the query profile of your application.
Also consider using a "side car" index if necessary, i.e. a special index with duplicated data from your main index to solve certain queries or query requirements. You pick which index to search based on the use case, and then return the appropriate data to the user.
We are indexing parties in our project which have names, alternate names, different identifiers, addresses and so on. And we would like to have STRICT exact search functionality using single/double inverted commas besides usual searching functionality (without inverted commas).
In order to achieve that we configured two different search handlers and switch between them based on existence inverted commas in user input. And also we indexed all mentioned party's attributes using for each one KeywordTokenizerFactory (for STRICT exact match search) and StandardTokenizerFactory (for usual search).
But the problem is the we doubled number of fields in Solr index and naturally its size.
So the question : is it possible to implement both types of searching based on having one field in Solr index per party attribute ?
If you had implemented the same functionality using a single field, you'd still have the more or less the exact amount of data in the index. The tokens you're searching against still has to be present and stored somewhere, and you'd end up with a confusing situation where it'd be very hard to score and rank hits in the different "types" contained in the same field (which, for all purposes, would be two fields, just with the same name.. so .. it's two fields..)
Using two fields as you currently are is the way to do this. But remember, you don't have to have to store content for all the fields (use stored="false" for fields that have identical values to other fields). That value would be identical for both/all fields, so just display the value from the first field, but search against them both / just the first / just the second.
Another option to reduce index size is to just store the id of the field, and then don't store any other fields. Retrieve any values from a primary data storage by looking up the id from the hit afterwards.
There are also many options you can disable for specific fields - which may not be needed depending on how you're using the field, such as termVectors, etc.
In some databases if you don't include sorting in the query, the database may access the same query results each time in different order. So if you are doing paging by sending multiple queries with different start position you may get the same raws multiple times.
Is it the same with Solr?
If I'm iterating all documents by changing the start parameter do I need to include some sorting field?
Documents are by default returned in the order they're added to the index. If a document is updated, it's effectively deleted and re-added, so it appears at the end of the index. If you're actually searching (and not just using fq), the score will be the same through each page of the result set (and the result set is sorted by score). If the index is updated, the score might change (as you'd expect).
So no, Solr and pagination does not require sorting. If you change the index while paginating, the results will change - just as it would if you sorted on an arbitrary field and added values that lands within the interval you're displaying.
To use the cursor support ("cursorMark" or deep paging), you'll have to have the uniqueKey of the collection in the sort (to make the sort deterministic for identical values), but that's not required for queries without a particular sort.
I need to navigate forth and back in Solr results set ordered by score viewing documents one by one. To visualise that, first a list of document titles is presented to user, then he or she can click one of the title to see more details and then needs to have an opportunity to move to the next document in the original list without getting back and clicking another title.
During viewing documents get changed: their dynamic field is modified (or created is not exists yet) to mark that document has already been viewed (used in other search).
The problem I face is that when the document is altered and re-indexed to keep those changes, sometimes (and not always, which is very disturbing) its place in the results set for the same query changes (in other words, it's score changes as that doesn't happen when browsing results sorted by one of the documents' fields). So, "Previous" / "Next" navigation doesn't work properly.
I'm not using any custom weighting or boosters on fields for score calculation. Also, that dynamic field changed during browsing doesn't participate in the query used to get the record set browsed.
So, the questions are: can the modification of the document's field not included in the query change its relevance score? And if it can, then how can I control that?
UPDATE
I did some tests and can add the following:
Document changes its place in the result set even if no field is amended - just requesting the document and re-indexing it without any changes to its fields makes it take another place next time the same query over the same index is executed.
That happens even if the result set is sorted explicitly ("first_name DESC"), so score (which depends on the update date) is not involved. The document stays the same, its field result set is sorted by is the same, yet its position changes.
Still have no idea how to avoid that.
In Solr, if your field is "indexed", it will have an effect on the relevancy ranking ("stored" fields show up in search results but are not necessarily searchable). If the fields in question aren't marked as indexed then you are good to go. Note that "indexed" and "stored" are not necessarily the same, hence you confusion about results lists changing even though not all fields are shown (a field can be "indexed" and not "stored" as well).
In this case I think you want your "viewed" field to be "stored" but not "indexed". If you really want to control the query, you can use copyField to copy the relevant results into a single searchable field. You can also boost terms or documents so that certain fields are "less important" to the search query.
If you want to see how the relevancy rankings are calculated, you can add "debugQuery=on" to the end of your Solr Query (see the Relevancy FAQ for more info).
However, all that being said, I would recommend you cache your search result query (at least for the first page for your results), since you will always have results changing (documents added, removed by other users, etc). Your best bet is to design a UI that anticipates this, or at least batches a user's query.
I've found the solution which doesn't eliminate the problem completely but makes it much less likely to happen.
So the problem happens when the documents are sorted by some field and there is a number of them with the same value in this field (e.g. result set is sorted by first name, and there are 100 entries for "John").
This is when the indexed time gets involved - apparently Solr uses it to sort the documents when their main sorting fields are identical. To make this case much less probable, you need to add more sorting fields, e.g. "first_name desc" should become "first_name desc, last_name desc, register_date asc".
Also, adding document's unique id as the last sorting field should remove the problem completely (the set of sorting fields will never be identical for any two documents in the index).
I wonder if there is a way to limit the spellchecking to just a part of the index.
Example i have an index containing different products used in different countries.
when a search is performed i limit the solr query to just return the results for COUNTRY X, however the suggestions that are returned are not limited to COUNTRY X, instead i receive results based on the whole index(since i only have one mispell index).
i beleive you can create a separate dictionary one for each country to solve this but here is the twist, i sometimes do a query where i want results back from COUNTRY_X and COUNTRY_Y thus also suggestions limited by those 2 countries, this would in turn result in a dictionary index of its own, seems a little to complicated and the number of dictionary indexes would be large.
I'd try splitting the index per country, i.e. one index for country X and another for country Y. You can easily do this with a multi-core setup. This way each index gets its own dictionary.
When you want to search on multiple countries at once you run a distributed query over the indexes. Distributed support for the spell checking component is only available in trunk as of this writing though.