lucene Fields vs. DocValues - solr

I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.
So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields
(like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?
First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?
Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?
Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...
Also, and perhaps most important, when should I use DocValues and when regular fields?
Joseph

Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
...
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.
The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.
If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).
Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.

Related

Understanding solr field cache

While I was working on this solr LowerCaseFilterFactory not working, I got the following error:
...enable docvalues true n reindex or place useFieldCache=true...
It was resolved by specifying useFieldCache=true in the query.
Is it the lucene FieldCache?
Can anybody help me know more about this?
When you're using docValues, the field cache isn't used. Since docValues isn't implemented for TextFields yet, the filtering hasn't been applied like you think it would, so the values used for sorting isn't lowercased as you'd assume they'd be.
When you tell Solr to explicitly use the FieldCache, you're saying "don't use the docValues, even if they're available - use the old FieldCache implementation instead".
The correct solution would be to disable docValues for the Text field.
In Lucene-Solr 4.5 and later, docValues are mostly disk-based to avoid the requirement for large heap allocations in Solr. If you use the field cache in sort, stats, and other queries, make those fields docValues
Please check this

Solr equivalent to ElasticSearch Mapping Type

ElasticSearch has Mapping Types to, according to the docs:
Mapping types are a way to divide the documents in an index into
logical groups. Think of it as tables in a database.
Is there an equivalent in Solr for this?
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't. So, is there an equivalent?
Or, maybe a better question,
If I have a multiple document types and I want to limit searches to a certain document type, which one should offer a better solution?
I hope this question has any sense since I'm new to both of them.
Thanks!
You can configure multicore solr:
http://wiki.apache.org/solr/CoreAdmin
Maybe something has changed since solr 4.0 and it's easier now, i didn't look at it since i have switched to elasticsearch. Personally i find elasticsearch indexes/types system much better than that.
In Solr 4+.
If you are planning to do faceting or any other calculations across multiple types than create a single schema with a differentiator field. Then, on your business/mapping/client layer just define only the fields you actually want to look at. Use custom search handlers with 'fl' field to only return the fields relevant to that object. Of course, that means that all those single-type-only fields cannot be compulsory.
If your document types are completely disjoint, you can create a core/collection per type, each with its own definition file. You have full separation, but still have only one Solr server to maintain.
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't.
You can exactly do this in Solr. Add a field and use it to filter.
It is correct that Mapping Types in ElasticSearch do not have to share the same schema but under the hood ElasticSearch uses only ONE schema for all Mapping Types. So technical it makes to difference. In fact the MappingType is mapped to an internal schema field.

Geoclusters in SOLR

We're reimplementing a search that includes locations that need to be clustered on a map. I've been searching without luck for an implementation in SOLR.
The current search with map clustering implemented is at http://www.uship.com/find
Has anyone seen similar or have ideas about how to best do this?
Regards,
Nick
If the requirement is to cluster a fairly small number of points, perhaps less than 1000, then Solr needn't be involved. Grab the points and plot them using something like HeatmapJS.
I presume the requirement is to cluster all results in a search which may potentially be many thousands or even millions of documents. I suggest starting with generating a heatmap of the densities over a grid of the search area. You can do this by indexing each point encoded in geohash form at each length (e.g. D2RY, D2R, D2, D). But then precede the length by how long it is: 4_D2RY, 3_D2R, 2_D2, 1_D. These little strings go into a multi-valued "string" type field in Solr that you will then facet on. When faceting, you'll come up with a suitable grid resolution (e.g. goehash prefix length) and then use that as a prefix query, like facet.prefix=4_ You can index the point using a LatLonType field separately and do a standard bounding box query there. At this point, you're faceted search results will give you the information to fill in a grid of numbers. The beauty of this scheme is that it is fast -- you could generate such heat-maps on the fly. It will use a fair amount of RAM though since this is faceting on a multi-valued field that will have a ton of values. This is something I want to add to the new Lucene spatial module (or perhaps at the Solr layer) in a way that won't need extra memory and to make it easy. It won't make it to Solr 4.0, but maybe 4.1.
At this stage, perhaps a heatmap is fine as-is. But you may want to apply clustering on top of this, as your question states. Someone tipped me off to some interesting geo clustering algorithms that can be applied to heatmaps.
I don't know whether you searched lucidworks, but there are many interesting resources there:
Search with Polygons: Another Approach to Solr Geospatial Search
Go through these:
http://www.lucidimagination.com/search/?q=geospatial#%2Fn
Already implemented in Solr:
http://wiki.apache.org/solr/SpatialSearch/ (what's wrong with this approach?)
http://wiki.apache.org/solr/SpatialSearchDev
https://issues.apache.org/jira/browse/SOLR-3304

Can Solr/Lucene do Fuzzy Field Collapsing?

Edit
Can Solr do fuzzy field collapsing? IE collapsing fields that have similar values, rather than identical ones?
I'd assumed that it could, but now I'm not sure, which makes my original question below invalid.
Original Question
For a large given set of values I need to decide which is the most prevalent. The set of all values will change over time, and so I can expect that the output may change over time too.
I gather Solr can do "field collapsing" to group results by a given field, with a tolerance of similarity. Would it be possible, neigh even appropriate, to use Solr solely to collapse fields, to derive the most common value? We use Solr in other parts of the business, and it would be good to leverage existing code rather than home-brewing a custom solution.
No, solr does not support fuzzy collapsing. (at least not based on what is documented on the wiki)
Solr 4.0 supports group.func which allows you to group results based on the result of a FunctionQuery, so it's possible that at some point in time a function could be created to get you approximately what you want, but none of the existing functions will do what you want.
However, Solr does support result clustering, which will maybe work for your use-case. Clustering is done with Carrot2. If you limit the fields used by carrot to a single field, you may get a similar result to "fuzzy clustering", but you have far less control over what carrot does than you do with field collapsing.
For a normal document you might want all your fields analyzed by carrot, e.g.:
carrot.title=my_title&carrot.snippet=my_title,my_description
But if you have, for example, a manufacturer field with slight variations of spelling or punctuation, it might work to only give carrot a single field for both title and snippet:
carrot.title=manufacturer&carrot.snippet=manufacturer

Sorting by recent access in Lucene / Solr

In my Solr queries, I want to sort most recently accessed documents to the top ("accessed" meaning opened by user action). No other search criteria has weight for me: of the documents with text matching the query, I want them in order of recent use. I can only think of two ways to do this:
1) Include a 'last accessed' date field in each doc to have Solr sort upon. Trie Date fields can be sorted very quickly, I'm told. The problem of course is keeping the field up to date, which would require storing each document's text so I can delete and re-add any document with an updated 'last accessed' field. Mutable fields would obviate this, but Lucene/Solr still doesn't offer mutable fields.
2) Alternatively, store the mutable 'last accessed' dates and keep them updated in another db. This would require Solr to return the full list of matching documents, which could be upwards of hundreds of thousands of documents. This huge list of document ids would then be matched up against dates in the db and then sorted. It would work OK for uncommon search terms, but not for broad, common search terms.
So the trade off is between 1) index size plus a processing cost every time a document is accessed and 2) big query overhead, especially for unfocused search terms
Do I have any alternatives?
http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles
http://blog.mikemccandless.com/2012/01/tochildblockjoinquery-in-lucene.html
You should be able to do this with the atomic update functionality.
http://wiki.apache.org/solr/Atomic_Updates
This functionality is available as of Solr 4.0. It allows you to update a single field in a document without having to reindex the entire document. I only know about this functionality from the documentation. I have not used it myself, so I can't say how well it works or if there are any pitfalls.
Definitely use option 1, using SOLR queries and updating the lastAccessed field as needed.
Since SOLR 4.0 partial document updates are suported in several falvours: https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
For your application it seems that a simple atomic update would be sufficient.
With respect to performance, this should work very well for large collections and fast document updates.

Resources