If I have set both stored=true and docValues=true for a given field as follows,
what would be the precedence chosen by solr? As in, is it like, if I am doing a sort/facet operation on the id field, the value will be retrieved from docValues and if I am executing a normal search query which returns id field, it will be returned from stored field ?
Please help me to clarify this.
stored values aren't used for anything other than returning the value to the client - i.e. what value was submitted for the field when it was sent to Solr. So yes, if you perform a search (or anything that results in a document being returned to the client), the stored value is read and returned.
docValues implements an actual feature in Lucene that makes certain operations more effective (such as sorting and faceting as you've mentioned). They're a lower level feature, and you don't really "see" the underlying stored docvalues.
I'm guessing you're confusing these values since there's support for "use doc values as stored value"; in certain cases the value stored in a docvalue will be the same as the same as what the user submitted (for example for integer fields), so if Lucene has already read and used the docvalue - and it is compatible with the stored value, you can tell it to skip fetching the stored value as well - saving a read.
Related
Please help understand the following regarding solr
1)Where are stored fields and docValues fields saved in solr?
2)if we are enabling docvalues for some fields, will the normal query (only search, with no faceting or sort applied) performance be better when compared to using stored fields?
3)Is it advisable to replace all the stored fields with docValues?
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
Stored fields store all field values for one document together in a row-stride fashion. while retrieval of document, all field values are returned at once per document, so that loading the relevant information about a document is very fast.
However, if you need to scan a field (for faceting/sorting/grouping/highlighting) it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration resulting in disk seeks.
Field values retrieved during search queries are typically returned from stored values. However, non-stored docValues fields will be also returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g. “fl=*”) for search queries depending on the effective value of the useDocValuesAsStored parameter for each field. For schema versions >= 1.6, the implicit default is useDocValuesAsStored="true"
When retrieving fields from their docValues form (using the /export handler, streaming expressions or if the field is requested in the fl parameter), two important differences between regular stored fields and docValues fields must be understood:
Order is not preserved. For simply retrieving stored fields, the insertion order is the return order. For docValues, it is the sorted order.
Multiple identical entries are collapsed into a single value. Thus if I insert values 4, 5, 2, 4, 1, my return will be 1, 2, 4, 5.
In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access.
In a environment with low-memory , or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting/function queries.
For more details please refer DocValues
I would like to use Solr atomic updates in combination with some stored copyField destination fields, which is not a recommended combination - so I wish to understand the risks.
The Solr documentation for Atomic Updates says (my emphasis):
The core functionality of atomically updating a document requires that
all fields in your schema must be configured as stored (stored="true")
or docValues (docValues="true") except for fields which are
<copyField/> destinations, which must be configured as stored="false".
Atomic updates are applied to the document represented by the existing
stored field values. All data in copyField destinations fields must
originate from ONLY copyField sources.
However, I have some copyField destinations that I would like to set stored=true so that highlighting works correctly for them (see this question, for example).
I need atomic updates so that an (unrelated) field can be modified by another process, without losing data indexed by my process.
The documentation warns that:
If destinations are configured as stored, then Solr will
attempt to index both the current value of the field as well as an
additional copy from any source fields. If such fields contain some
information that comes from the indexing program and some information
that comes from copyField, then the information which originally came
from the indexing program will be lost when an atomic update is made.
But what does that mean? Can someone give an example that demonstrates this information-loss problem?
I am unsure what is meant by "some information that comes from the indexing program and some information that comes from copyField", in concrete terms.
Is it safe to make one copyField destination stored, whilst atomically updating other fields, or vice versa? I have tried this out via the Solr Admin console, and have not been able to demonstrate any issues, but would like to be clear on what circumstances would trigger the problem.
It means that the copy field will have an additional value added from the source field effectively creating a multi-valued field in your copyField, which if it isn't defined as multi-valued then the field won't be of the right type and no further updates can be made to it, until you reindex everything. I'm currently struggling with this exact issue, because we need the values to come back as part of the response for the copyField, which means it needs to be stored, but by doing so breaks the structure of the document if we do an atomic update on a different field.
In Solr, when I set the field as 'indexed' but not 'stored', it is still stored in the index. If I go the other way around and set the field as 'stored' but not 'indexed', it is also stored in the index if I understand correctly.
My question is, how is the document stored internally in Lucene in these cases? How do 'stored' fields look like in Lucene and how do 'indexed' fields look like in Lucene internally.
The answer to this question will perhaps help me understand why atomic updates in Solr only work with stored fields and not indexed fields (as explained here: https://wiki.apache.org/solr/Atomic_Updates#Stored_Values).
In Solr/Lucene, indexed and stored are two different concepts.
indexed means the field value will be saved in inverted index, and you can search on them when you do query. But you can't see them in search results documents.
stored just means it will be saved in stored field value part, not in inverted index, which means it cannot be searched, but can be used to display when you get search results documents.
Actually, the way how Solr do update is, it will take out the whole document(only stored fields), change the value you want to update, and save them back(with re-index). That's why it can only support stored fields.
In my project, I have come to terms with creating relations between documents by storing them in a comma separated field, e.g.: relatedDocIds="2455,4564,7345" Those relations are updated from time to time by using a scheduled job that run through my DB, fetches a record, and updates its Solr document.
I know that instead of using a single comma-separated string field, I can use a multiValued string, where each ID could take one value slot. due to some limitations of my client API though, I can only set one value per field at the moment. I have not seen any disadvantages to using it the way I do, i.e. queries, such as relatedDocIds:2455 resolve exactly the way I want them to. The documentation of multiValued says that it does the same thing.
Am I missing a potential advantage of using multiValued? Is my method OK, and what are its limits? What would be a better and more optimized approach to store those IDs?
You are fine. Under the covers, the indexed form of the multiValued field is converted to a set of tokens, the same as if your tokenizer split them using that particular tokenizer's rule.
The main difference is that multiValued field pretends that the end token of one value and start token of another value are far from each other. That's what the positionIncrementGap means (usually 100).
This matters if you wanted to do a phrase search like "2455,4564". In your case, I believe, it will match, but if you had them as multiValued field with each value separate, it would not.
And, of course, multiValued fields - if stored - are returned as an array of value. Strings - if stored - are returned as they were given, even if the indexed version has all been broken up into the tokens.
On the documentation for Sunspot, it says:
If you make a change to the object's "schema" (code in the searchable block), you must reindex all objects so the changes are reflected in Solr
What happens if this procedure isn't followed?
Specifically, I have a fairly large index on Websolr, and if I just add a boolean field to it without reindexing, what will happen?
I'd like to be able to filter by true values of the boolean field, but I'll never need to filter by false or nil values. Will this work, or must this admonition to reindex always be obeyed?
In your case, if you add the field and do not index the data, it would still work.
However, the existing data would not have a value for the field.
Only the new documents inserted would have values for it.
You can surely filter on the documents based on the values and the existing documents would have a nil value for the field.
Usually it depends on what you change.
You would not need a reindex in if you change query time analysis of a field type.
A simple restart or core reload would work for you.
Changes in schema would require a reindex of the collection, if you want the value of the field for all documents.
If you change a field type, you would need to reindex the content as the analysis done at indexing time on the types of the field would be different.
If you don't reindex the Query time analysis performed for the field would be different from the one indexed and no matches would be found.