About stored field vs docvalues in solr

About stored field vs docvalues in solr - solr

Please help understand the following regarding solr
1)Where are stored fields and docValues fields saved in solr?
2)if we are enabling docvalues for some fields, will the normal query (only search, with no faceting or sort applied) performance be better when compared to using stored fields?
3)Is it advisable to replace all the stored fields with docValues?

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
Stored fields store all field values for one document together in a row-stride fashion. while retrieval of document, all field values are returned at once per document, so that loading the relevant information about a document is very fast.
However, if you need to scan a field (for faceting/sorting/grouping/highlighting) it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration resulting in disk seeks.
Field values retrieved during search queries are typically returned from stored values. However, non-stored docValues fields will be also returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g. “fl=*”) for search queries depending on the effective value of the useDocValuesAsStored parameter for each field. For schema versions >= 1.6, the implicit default is useDocValuesAsStored="true"
When retrieving fields from their docValues form (using the /export handler, streaming expressions or if the field is requested in the fl parameter), two important differences between regular stored fields and docValues fields must be understood:
Order is not preserved. For simply retrieving stored fields, the insertion order is the return order. For docValues, it is the sorted order.
Multiple identical entries are collapsed into a single value. Thus if I insert values 4, 5, 2, 4, 1, my return will be 1, 2, 4, 5.
In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access.
In a environment with low-memory , or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting/function queries.
For more details please refer DocValues

Related

Precedence of stored vs docValues in solr

If I have set both stored=true and docValues=true for a given field as follows,
what would be the precedence chosen by solr? As in, is it like, if I am doing a sort/facet operation on the id field, the value will be retrieved from docValues and if I am executing a normal search query which returns id field, it will be returned from stored field ?
Please help me to clarify this.

stored values aren't used for anything other than returning the value to the client - i.e. what value was submitted for the field when it was sent to Solr. So yes, if you perform a search (or anything that results in a document being returned to the client), the stored value is read and returned.
docValues implements an actual feature in Lucene that makes certain operations more effective (such as sorting and faceting as you've mentioned). They're a lower level feature, and you don't really "see" the underlying stored docvalues.
I'm guessing you're confusing these values since there's support for "use doc values as stored value"; in certain cases the value stored in a docvalue will be the same as the same as what the user submitted (for example for integer fields), so if Lucene has already read and used the docvalue - and it is compatible with the stored value, you can tell it to skip fetching the stored value as well - saving a read.

Solr Query Performance on Large Number Of Dynamic Fields

This question is a follow-up question to my previous question: Is child documents in solr anti-pattern?
I am creating a new question on dynamic field performance as I did not find any recent relevant posts on this topic and felt it deserved a separate question here.
I am aware that dynamic fields are treated as static fields and performance-wise both are similar.
Further, from what I have read, in terms of memory, dynamic fields are not efficient. Say, if a document has 100 fields and another has 1000(max number of fields in the collection), Apache Solr will allocate the memory block to support all 1000 fields for all the documents in the collection.
I have a requirement where I have 6-7 fields that could be part of child documents and each parent document could have up to 300 child documents. Which means each parent document could have ~2000 fields.
What will be the performance impact on queries when we have such a large number of fields in the document?

That really depends on what you want to do with the field and what the definition of these fields are. With docValues, most earlier issues with memory usage for sparse fields (i.e. fields that only have values in a small number of the total number of documents) are solved.
Also, you can usually rewrite those dynamic fields to a single multiValued field for filtering instead of filtering on each field (i.e. common_field:field_prefix_value where common_field contains the values you want to filter on prefixed with a field name / unique field id).
Anyway, the last case is that it'll depend on how many documents you have in total. If you only have 1000 documents, it won't be an issue in any way. If you have a million, it used to be - depending on what you needed those dynamic fields for. These days it really isn't an issue, and I'd start out with the naive, direct solution and see if that works properly for your use case. It's rather hard to say without knowing exactly what these fields will contain, what the use case for the fields are, what they'll be used for and the query profile of your application.
Also consider using a "side car" index if necessary, i.e. a special index with duplicated data from your main index to solve certain queries or query requirements. You pick which index to search based on the use case, and then return the appropriate data to the user.

Indexing Architecture for frequently updated index solr?

I have roughly 50M documents, 90 (stored(20) + non- stored(70)) fields in schema.xml indexed in single core. The queries are quiet complex along with faceting and highlighting. Out of this 90 fields, there are 3-4 fields (all stored) which are very frequently uploaded. Now, updating these field normally would require populating all the fields again which is heavy task. If I use atomic/partial update, we have to update the non-stored fields again.
Our Solution:
To overcome the above problems, we decided to use SolrCloud and Join queries. We split the index into two separate indexes/collection i.e one for stored fields and one for non-stored fields. The relation b/w the documents being the id of the doc. We kept the frequently updated fields in stored index. By doing this we were able to leverage atomic updates. Also to overcome the limitation of join queries in cloud, we sharded & replicated the stored fields across all nodes but the non-stored was not sharded but replicated across all nodes.we have a 5 node cluster with additional 3 instances of zookeeper. Considering the number of docs, the only area of concern is that will join queries eventually degrade search performance? If so, what other options I can consider.

Thinking about Joins makes Solr more like a Relational database. I have found an article on this from the Lucidworks team Solr and Joins. Even they are saying that if your solution includes the use of Join then it means you need to rethink about that.
I think I have a solution for you guys. First of all, forget two collections.You create one collection and You are going to have two Solr document for every single document. Now one document will have the stored fields and the other has the non-stored fields. At the time of updating you will update the document which has stored field and perform a search-related operation on the other document.
Now all you need to do is at the time of query you need to merge both the documents into a single document which can be done by writing service layer over the Solr.

I have a issue with partial/atomic updates and index operations on fields in the background, I did not modify. This is different to the question, but maybe the use of nested documents is worth thinking about.
I was checking the use of nested documents to separate document header data from text content to be indexed, since processing the text content is consuming a lot resources. According to the docs, parent and childs are indexed as blocks and always have to be indexed together.
This is stated in https://solr.apache.org/guide/8_0/indexing-nested-documents.html:
With the exception of in-place updates, the whole block must be updated or deleted together, not separately. For some applications this may result in tons of extra indexing and thus may be a deal-breaker.
So as long as you are not able to perform in-place updates (which have their own restrictions in terms of indexed, stored and <copyField...> directives), the use of nested documents does not seem to be a valid approach.

How does Solr filter on fields in documents

I am currently doing POC on a collection with ~5000 fields/ document. All the fields have (stored,indexed attributes as true). I am interested in only displaying ~5 fields from each of the matching documents.
I am trying to understand if Solr brings all the ~5k fields of a matched document into memory from the .fdt files and discards the rest to only keep the 5 fields I am interested in?
My concern is with the memory usage if it keeps bringing all the fields in memory.
Any light in this regard will be much appreciated.
Thanks for your time.

In your managed-schema.xml or schema.xml, each field has a 'stored' attribute and an 'indexed' attribute. The indexed attribute allows the field to be found in full-text search. The stored attribute stores the data so that it can be displayed later on.
It sounds like if you only want to display 5 fields and are worried about space, the solution would be to set those 5 fields to stored=true and the rest of your fields to stored=false while keeping all fields set to indexed=true. This will minimize the memory that SOLR will use up while still allowing full-text search.

How are Solr's stored and indexed fields stored internally (in Lucene)

In Solr, when I set the field as 'indexed' but not 'stored', it is still stored in the index. If I go the other way around and set the field as 'stored' but not 'indexed', it is also stored in the index if I understand correctly.
My question is, how is the document stored internally in Lucene in these cases? How do 'stored' fields look like in Lucene and how do 'indexed' fields look like in Lucene internally.
The answer to this question will perhaps help me understand why atomic updates in Solr only work with stored fields and not indexed fields (as explained here: https://wiki.apache.org/solr/Atomic_Updates#Stored_Values).

In Solr/Lucene, indexed and stored are two different concepts.
indexed means the field value will be saved in inverted index, and you can search on them when you do query. But you can't see them in search results documents.
stored just means it will be saved in stored field value part, not in inverted index, which means it cannot be searched, but can be used to display when you get search results documents.
Actually, the way how Solr do update is, it will take out the whole document(only stored fields), change the value you want to update, and save them back(with re-index). That's why it can only support stored fields.