Solr Query Performance on Large Number Of Dynamic Fields - solr

This question is a follow-up question to my previous question: Is child documents in solr anti-pattern?
I am creating a new question on dynamic field performance as I did not find any recent relevant posts on this topic and felt it deserved a separate question here.
I am aware that dynamic fields are treated as static fields and performance-wise both are similar.
Further, from what I have read, in terms of memory, dynamic fields are not efficient. Say, if a document has 100 fields and another has 1000(max number of fields in the collection), Apache Solr will allocate the memory block to support all 1000 fields for all the documents in the collection.
I have a requirement where I have 6-7 fields that could be part of child documents and each parent document could have up to 300 child documents. Which means each parent document could have ~2000 fields.
What will be the performance impact on queries when we have such a large number of fields in the document?

That really depends on what you want to do with the field and what the definition of these fields are. With docValues, most earlier issues with memory usage for sparse fields (i.e. fields that only have values in a small number of the total number of documents) are solved.
Also, you can usually rewrite those dynamic fields to a single multiValued field for filtering instead of filtering on each field (i.e. common_field:field_prefix_value where common_field contains the values you want to filter on prefixed with a field name / unique field id).
Anyway, the last case is that it'll depend on how many documents you have in total. If you only have 1000 documents, it won't be an issue in any way. If you have a million, it used to be - depending on what you needed those dynamic fields for. These days it really isn't an issue, and I'd start out with the naive, direct solution and see if that works properly for your use case. It's rather hard to say without knowing exactly what these fields will contain, what the use case for the fields are, what they'll be used for and the query profile of your application.
Also consider using a "side car" index if necessary, i.e. a special index with duplicated data from your main index to solve certain queries or query requirements. You pick which index to search based on the use case, and then return the appropriate data to the user.

Related

Indexing Architecture for frequently updated index solr?

I have roughly 50M documents, 90 (stored(20) + non- stored(70)) fields in schema.xml indexed in single core. The queries are quiet complex along with faceting and highlighting. Out of this 90 fields, there are 3-4 fields (all stored) which are very frequently uploaded. Now, updating these field normally would require populating all the fields again which is heavy task. If I use atomic/partial update, we have to update the non-stored fields again.
Our Solution:
To overcome the above problems, we decided to use SolrCloud and Join queries. We split the index into two separate indexes/collection i.e one for stored fields and one for non-stored fields. The relation b/w the documents being the id of the doc. We kept the frequently updated fields in stored index. By doing this we were able to leverage atomic updates. Also to overcome the limitation of join queries in cloud, we sharded & replicated the stored fields across all nodes but the non-stored was not sharded but replicated across all nodes.we have a 5 node cluster with additional 3 instances of zookeeper. Considering the number of docs, the only area of concern is that will join queries eventually degrade search performance? If so, what other options I can consider.
Thinking about Joins makes Solr more like a Relational database. I have found an article on this from the Lucidworks team Solr and Joins. Even they are saying that if your solution includes the use of Join then it means you need to rethink about that.
I think I have a solution for you guys. First of all, forget two collections.You create one collection and You are going to have two Solr document for every single document. Now one document will have the stored fields and the other has the non-stored fields. At the time of updating you will update the document which has stored field and perform a search-related operation on the other document.
Now all you need to do is at the time of query you need to merge both the documents into a single document which can be done by writing service layer over the Solr.
I have a issue with partial/atomic updates and index operations on fields in the background, I did not modify. This is different to the question, but maybe the use of nested documents is worth thinking about.
I was checking the use of nested documents to separate document header data from text content to be indexed, since processing the text content is consuming a lot resources. According to the docs, parent and childs are indexed as blocks and always have to be indexed together.
This is stated in https://solr.apache.org/guide/8_0/indexing-nested-documents.html:
With the exception of in-place updates, the whole block must be updated or deleted together, not separately. For some applications this may result in tons of extra indexing and thus may be a deal-breaker.
So as long as you are not able to perform in-place updates (which have their own restrictions in terms of indexed, stored and <copyField...> directives), the use of nested documents does not seem to be a valid approach.

Solr: Sort documents by number of fields

Is there a way sort documents in Solr by the number of fields in each document?
The solr core in question has about 200 different fields, while not every field must be present in every doucment. To circle out datasets that contain not enough fields to be correct, I'd like to work through a *:* query sorted from the lowest number of fields per documents upwards.
I didn't find anything on this specific use case. Most results I found were about the relevance of individual fields, however this might not help here given the large field spectrum of the core.
It might be possible by sorting on a function query. That function would return a value that would be higher the more fields the doc has. But I am afraid that function would be huge (and slow), as it would need to enumerate all fields in the function.
By far the easiest thing would be to, at index time, add a 'nbFields' field containing the number of fields. Then you can sort easily on that one.

How to implement usual and exact match search based on the same fields in index?

We are indexing parties in our project which have names, alternate names, different identifiers, addresses and so on. And we would like to have STRICT exact search functionality using single/double inverted commas besides usual searching functionality (without inverted commas).
In order to achieve that we configured two different search handlers and switch between them based on existence inverted commas in user input. And also we indexed all mentioned party's attributes using for each one KeywordTokenizerFactory (for STRICT exact match search) and StandardTokenizerFactory (for usual search).
But the problem is the we doubled number of fields in Solr index and naturally its size.
So the question : is it possible to implement both types of searching based on having one field in Solr index per party attribute ?
If you had implemented the same functionality using a single field, you'd still have the more or less the exact amount of data in the index. The tokens you're searching against still has to be present and stored somewhere, and you'd end up with a confusing situation where it'd be very hard to score and rank hits in the different "types" contained in the same field (which, for all purposes, would be two fields, just with the same name.. so .. it's two fields..)
Using two fields as you currently are is the way to do this. But remember, you don't have to have to store content for all the fields (use stored="false" for fields that have identical values to other fields). That value would be identical for both/all fields, so just display the value from the first field, but search against them both / just the first / just the second.
Another option to reduce index size is to just store the id of the field, and then don't store any other fields. Retrieve any values from a primary data storage by looking up the id from the hit afterwards.
There are also many options you can disable for specific fields - which may not be needed depending on how you're using the field, such as termVectors, etc.

Is there a better way to represent provenenace on a field level in SOLR

I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).

Limiting solr spellchecking based on attributes

I wonder if there is a way to limit the spellchecking to just a part of the index.
Example i have an index containing different products used in different countries.
when a search is performed i limit the solr query to just return the results for COUNTRY X, however the suggestions that are returned are not limited to COUNTRY X, instead i receive results based on the whole index(since i only have one mispell index).
i beleive you can create a separate dictionary one for each country to solve this but here is the twist, i sometimes do a query where i want results back from COUNTRY_X and COUNTRY_Y thus also suggestions limited by those 2 countries, this would in turn result in a dictionary index of its own, seems a little to complicated and the number of dictionary indexes would be large.
I'd try splitting the index per country, i.e. one index for country X and another for country Y. You can easily do this with a multi-core setup. This way each index gets its own dictionary.
When you want to search on multiple countries at once you run a distributed query over the indexes. Distributed support for the spell checking component is only available in trunk as of this writing though.

Resources