Understanding solr field cache - solr

While I was working on this solr LowerCaseFilterFactory not working, I got the following error:
...enable docvalues true n reindex or place useFieldCache=true...
It was resolved by specifying useFieldCache=true in the query.
Is it the lucene FieldCache?
Can anybody help me know more about this?

When you're using docValues, the field cache isn't used. Since docValues isn't implemented for TextFields yet, the filtering hasn't been applied like you think it would, so the values used for sorting isn't lowercased as you'd assume they'd be.
When you tell Solr to explicitly use the FieldCache, you're saying "don't use the docValues, even if they're available - use the old FieldCache implementation instead".
The correct solution would be to disable docValues for the Text field.

In Lucene-Solr 4.5 and later, docValues are mostly disk-based to avoid the requirement for large heap allocations in Solr. If you use the field cache in sort, stats, and other queries, make those fields docValues
Please check this

Related

Stemming parameter in Solr

Is there any parameter like (edismax or dismax or any other) that i can set for stemming to work in Solr or i need to make changes in schema.xml of Solr to implement the stemming ?
Problem is if i change schema.xml by default stemming/phoentic work which i dont want ? I am using Solr from third party application and in UI we have checkbox for stemming to check/uncheck , i pass these paramaters to Solr and get the data from Solr, i cant pass this UI parameter to SOlr, so if there is any parameter at Solr side i can pass that for stemming to work ?
Please let me know ?
Stemming is performed as part of the analysis chain, and therefor is part of how the schema for that particular field is defined.
The reason for this becomes apparent when you consider how stemming works - for stemming to make sense, the term has to be stemmed when it's being indexed, as well as when being queried.
Lucene takes your input string, runs it through your analysis chain and saves the generated tokens to its index. Giving it what are you asking will probably end up as what, are, you, ask after tokenizing by whitespace and applying stemming.
The same operation happens when querying, so if someone searches for asks, the token gets stemmed to ask - and then compared against what's in the index. If stemming hadn't taken place when indexing, you'd end up with asking in the index, and ask when querying - and that isn't a match, since the tokens aren't the same.
In your third party application the stemming option probably performs stemming inside the application before sending the content to Solr.
You can also use the Schema API to dynamically update and change field type definitions.

How to disable caching for sort query in solr?

I am sorting by 'id' field which has 25 million unique values. Hence after sorting I could see the fieldcache populated for 'id' field with more than 1 GB. Hence I want to disable making entry in fieldcache because I can optimize for Heapmemory with performance.I tried giving q={!cache=false}*.* but could still see the fieldcache is populated with id field. I also gave docValues=true,indexed=false and stored=false but nothing could prevent 'id' field from populating in fieldcache. I am using solr 5.2.1 in cloud mode. Is there any way to achieve this?
After changing the field definition you'll have to reindex all the documents. Since the change to using docValues won't get populated for all the documents otherwise, you'll still end up with the old structure taking up space in your field cache.
If you're just reindexing a single document and then committing that one, that document is the only one where docValues has been populated. Optimizing the index will merge the segments that make up the index, but not rewrite their content.
I'm not familiar with a way to disable the field cache. Since it's a low level Lucene concept, Solr can't really do much about it.

How to disable Solr query analysis?

I'm working with solr5.2 and I'm using termVectors with solrj (but an answer not using solrj would be nice as well).
From a first query, I obtain termVectors, and I'd like to query again my index with some of the terms from these termVectors.
However the terms from termVectors are obviously already stemmed, and I'd like to go directly to the corresponding entry in the index, without going through the query analysis step (otherwise, my stem will be stemmed again, which can lead to a different entry).
A workaround would be to stem all terms at indexing time, and to index them in a separate String field, but I'd like to avoid this ugly solution.
Is there a better way?
You can define separate analysis chains for query and indexing (I read your caveat as having to do it outside of Solr, as you're talking about String fields):
<analyzer type="index">
So you could have one field that does not perform stemming on query, just on indexing. That might not be suitable for your primary field, so add a second one and use copyField to index into that field as well.

Solr queries stored within Solr field

I have a set of keywords defined by client requirements stored in a SOLR field. I also have a never ending stream of sentences entering the system.
By using the sentence as the query against the keywords I am able to find those sentences that match the keywords. This is working well and I am pleased. What I have essentially done is reverse the way in which SOLR is normally used by storing the query in Solr and passing the text in as the query.
Now I would like to be able to extend the idea of having just a keyword in a field to having a more fully formed SOLR query in a field. Doing so would allow proximity searching etc. But, of course, this is where life becomes awkward. Placing SOLR query operators into a field will not work as they need to be escaped.
Does anyone know if it might be possible to use the SOLR "query" function or perhaps write a java class that would enable such functionality? Or is the idea blowing just a bit too much against the SOLR winds?
Thanks in advance.
ES has percolate for this - for Solr you'll usually index the document as a single document in a memory based core / index and then run the queries against that (which is what ES at least used to do internally, IIRC).
I would check out the percolate api with ElasticSearch. It would sure be easier using this api than having to write your own in Solr.

lucene Fields vs. DocValues

I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.
So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields
(like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?
First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?
Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?
Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...
Also, and perhaps most important, when should I use DocValues and when regular fields?
Joseph
Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
...
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.
The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.
If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).
Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.

Resources