We're reimplementing a search that includes locations that need to be clustered on a map. I've been searching without luck for an implementation in SOLR.
The current search with map clustering implemented is at http://www.uship.com/find
Has anyone seen similar or have ideas about how to best do this?
Regards,
Nick
If the requirement is to cluster a fairly small number of points, perhaps less than 1000, then Solr needn't be involved. Grab the points and plot them using something like HeatmapJS.
I presume the requirement is to cluster all results in a search which may potentially be many thousands or even millions of documents. I suggest starting with generating a heatmap of the densities over a grid of the search area. You can do this by indexing each point encoded in geohash form at each length (e.g. D2RY, D2R, D2, D). But then precede the length by how long it is: 4_D2RY, 3_D2R, 2_D2, 1_D. These little strings go into a multi-valued "string" type field in Solr that you will then facet on. When faceting, you'll come up with a suitable grid resolution (e.g. goehash prefix length) and then use that as a prefix query, like facet.prefix=4_ You can index the point using a LatLonType field separately and do a standard bounding box query there. At this point, you're faceted search results will give you the information to fill in a grid of numbers. The beauty of this scheme is that it is fast -- you could generate such heat-maps on the fly. It will use a fair amount of RAM though since this is faceting on a multi-valued field that will have a ton of values. This is something I want to add to the new Lucene spatial module (or perhaps at the Solr layer) in a way that won't need extra memory and to make it easy. It won't make it to Solr 4.0, but maybe 4.1.
At this stage, perhaps a heatmap is fine as-is. But you may want to apply clustering on top of this, as your question states. Someone tipped me off to some interesting geo clustering algorithms that can be applied to heatmaps.
I don't know whether you searched lucidworks, but there are many interesting resources there:
Search with Polygons: Another Approach to Solr Geospatial Search
Go through these:
http://www.lucidimagination.com/search/?q=geospatial#%2Fn
Already implemented in Solr:
http://wiki.apache.org/solr/SpatialSearch/ (what's wrong with this approach?)
http://wiki.apache.org/solr/SpatialSearchDev
https://issues.apache.org/jira/browse/SOLR-3304
Related
I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.
So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields
(like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?
First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?
Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?
Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...
Also, and perhaps most important, when should I use DocValues and when regular fields?
Joseph
Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
...
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.
The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.
If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).
Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.
I am searching for a retrieval server right now for my image retrieval project. As I see from the Internet, Lucene and Solr are particularized for textual seraching but do you think is it possible and reasonable to convey them for image retrieval.
You might suggest a image specific tool like LIRE but it has predefined featreu extraction algorithms and not very flexible for new features. Basically, all I need to index my image features from my extraction pipeline (written in Python) with a server like Lucene or Solr and perform some retrieval tasks based on Euclidean distance on indexed features.
Any suggestion or pointer to any reference would be very useful. Thanks.
Based on your post , you could store the features as keyword fields in Lucene or ES (solr has a strict schema definition and i don't think it would fit your needs very well as the feature matrix is usually sparsely populated in my understanding), and have a unique ID field from the image hash. Then you can just search for feature values ( feature1:value1 AND feature2:value2) and see what matches the query.
If you're going to work with Euclidean distances, you'll want to look into using the Spatial Features of Solr. This will allow you to index your values as coordinates, then perform indexed lookups from other points and sort by their Euclidean distances.
You might also want to look at the dist and sqedist functions.
We have an application for tagging user selections over a large corpus of MS Word documents. We tag these selections with one or more keyword tags, and usually a title tag. We want to add a feature where the selected text is instantly analyzed, and the tagger is presented with a list of most-likely keyword and title tags (based on the existing tagged text selections)
We are using a SOLR index. I have been told that we can simply issue the selected text as the query itself to return similar selections. However, the selected text could be anywhere between 200 and 6000 words long. A 6000 word query may be a problem in terms of memory usage!
I thought we could do some very aggressive stopword removal to significantly reduce the number of words in the queries, leaving only the very meaningful words. We have been working with this corpus for the last 10 years and we are very familiar with the subject matter and the vocabulary used, so this would be easy for us to do. But the problem is that we also use the same index for allowing the normal users to search the index, and if we remove too many common words, then their normal queries may not work properly (especially phrase queries).
We would also like to boost the results that contain the text of the query within a smaller range, rather than just spread arbitrarily throughout the document.
Another issue is that we allow nested selections. The outer selection may be more general in nature and be around 5000 words long, and the inner selections will be shorter and topically more specific. However, since both selections contain the same text, SOLR ranks them both highly, when the outer selection may not be so relevant
I have spent the last few days going through the SOLR query parser documentation, and it looks like this should be doable, but I'm still not sure exactly what I need to do to make this work. Any suggestions would be much appreciated.
Solr have multi-core facility. So if you can have one core for your internal work and you can reveal the other core for public domain, it may solve your issue.
You can refer this section
http://wiki.apache.org/solr/Solr.xml%20(supported%20through%204.x)
or you can refer Solr cores and solr.xml section in solr reference manual.
I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.
Edit
Can Solr do fuzzy field collapsing? IE collapsing fields that have similar values, rather than identical ones?
I'd assumed that it could, but now I'm not sure, which makes my original question below invalid.
Original Question
For a large given set of values I need to decide which is the most prevalent. The set of all values will change over time, and so I can expect that the output may change over time too.
I gather Solr can do "field collapsing" to group results by a given field, with a tolerance of similarity. Would it be possible, neigh even appropriate, to use Solr solely to collapse fields, to derive the most common value? We use Solr in other parts of the business, and it would be good to leverage existing code rather than home-brewing a custom solution.
No, solr does not support fuzzy collapsing. (at least not based on what is documented on the wiki)
Solr 4.0 supports group.func which allows you to group results based on the result of a FunctionQuery, so it's possible that at some point in time a function could be created to get you approximately what you want, but none of the existing functions will do what you want.
However, Solr does support result clustering, which will maybe work for your use-case. Clustering is done with Carrot2. If you limit the fields used by carrot to a single field, you may get a similar result to "fuzzy clustering", but you have far less control over what carrot does than you do with field collapsing.
For a normal document you might want all your fields analyzed by carrot, e.g.:
carrot.title=my_title&carrot.snippet=my_title,my_description
But if you have, for example, a manufacturer field with slight variations of spelling or punctuation, it might work to only give carrot a single field for both title and snippet:
carrot.title=manufacturer&carrot.snippet=manufacturer