I have read many article here. I have concluded that index is for searching, and docValue is for sorting, faceting. I am confused that whether index and docValue are the same data structure or same idea(store column value to get doc id)? If it is not the same, where is the different?
Inverted index ::
Inverted Index is a concept, which is used for building the search library Lucene.
The standard way that Solr builds the index is with an inverted index.
This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.
This is like retrieving pages in a book related to a keyword by scanning the index at the back of a book,
as opposed to searching every word of every page of the book.
This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).
Solr stores this index in a directory called index in the data directory.
DocValue ::
For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).
In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
For docValues, you only need to enable it for a field that you will use it with.
As with all schema design, you need to define a field type and then define fields of that type with docValues enabled. Enabling a field for docValues only requires adding docValues="true" to the field. DocValues are only available for specific field types.
<field name="category" type="string" indexed="false" stored="false" docValues="true" />
Related
Please help understand the following regarding solr
1)Where are stored fields and docValues fields saved in solr?
2)if we are enabling docvalues for some fields, will the normal query (only search, with no faceting or sort applied) performance be better when compared to using stored fields?
3)Is it advisable to replace all the stored fields with docValues?
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
Stored fields store all field values for one document together in a row-stride fashion. while retrieval of document, all field values are returned at once per document, so that loading the relevant information about a document is very fast.
However, if you need to scan a field (for faceting/sorting/grouping/highlighting) it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration resulting in disk seeks.
Field values retrieved during search queries are typically returned from stored values. However, non-stored docValues fields will be also returned along with other stored fields when all fields (or pattern matching globs) are specified to be returned (e.g. “fl=*”) for search queries depending on the effective value of the useDocValuesAsStored parameter for each field. For schema versions >= 1.6, the implicit default is useDocValuesAsStored="true"
When retrieving fields from their docValues form (using the /export handler, streaming expressions or if the field is requested in the fl parameter), two important differences between regular stored fields and docValues fields must be understood:
Order is not preserved. For simply retrieving stored fields, the insertion order is the return order. For docValues, it is the sorted order.
Multiple identical entries are collapsed into a single value. Thus if I insert values 4, 5, 2, 4, 1, my return will be 1, 2, 4, 5.
In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access.
In a environment with low-memory , or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting/function queries.
For more details please refer DocValues
I'm trying to use the MoreLikeThis Solr's feature to find similar document based on some other document, but the I don't quite understand how some of this functionality works.
As it says here, the MoreLikeThis component works best, when the termVectors are stored. And here comes my confusion.
Is it enough that I enable the flag termVectors on a field (let's say the field contains a movie review text) in Solr's schema.xml file? Will it make Solr calculate the termVectors for a given field after inserting it, store it and then use the calculcated termVectors in subsequent calls to the MoreLikeThis handler?
Short answer is NO, you need to re-index after such a schema change.
Having the term vector enabled, will speed up the process of finding the interesting terms from the original input document ( if this document is in the index).
Second phase timing (when More Like This query happens), will remain the same.
For more information about how the MLT works [1] .
In general, when applying such changes to the schema, you need to re-index your documents to make Solr builds the related data structures(the term vector is a mini index per document, and requires specific files to be stored on disk[2]
N.B. this will increase your disk utilisation)
[1] https://www.slideshare.net/AlessandroBenedetti/advanced-document-similarity-with-apache-lucene
[2] https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene50/Lucene50TermVectorsFormat.html
SOLR (Lucene) indices, like all inverted indices, use a term dictionary to assign an index to each term. Each field in the index generates its own term dictionary (which can be inspected in the SOLR admin tool).
I have a very large SOLR index, where each document has very many textual fields. All fields contain english text following a similar distribution.
In my case this is very wasteful: it maintains many very large term dictionaries (in memory) which are almost all the same... as the number of (different) terms in the documents grows these dictionaries grow very large.
I cannot combine all fields into a single search field because I need to run queries restricted over specific fields.
Is there a way to tell SOLR to use the same term dictionary for several fields?
(Afterthought: but perhaps if terms follow a zipfian distribution, the ammount of sharing between fields won't be significant anyway as many terms will appear only once and hence only in one dict?)
I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.
So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields
(like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?
First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?
Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?
Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...
Also, and perhaps most important, when should I use DocValues and when regular fields?
Joseph
Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
...
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.
The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.
If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).
Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.
I am using solr to provide faceted search on articles. Generally from what I read, the faceted fields need to be indexed together with the rest of the article. However in my case, the faceted fields are derived ontology terms, not in the original content, and I might add additional ontologies as we go.
One solution is to add these new terms into the original articles as new fields, and index whole. However, the ontology terms can change and grow frequently, and reindexing the entire collection could take very long time (one+ month), so I am wondering if I can just index these ontology terms as separate core, without re-indexing the original articles. Is it feasible to query across multiple cores with hierarchical faceted search?
Thanks!