Trying to use carrot2 for doing to resultset clustering. I have couple of questions with respect to this.
a) Can we cluster the documents in Solr/Lucene based on the specific fields in solr? like cluster them based name, person name and geo-distance location (lat, long) with specific field weights?
b) My use case for clustering is not really online, it is more of a batch use case, given that, do we still have this restriction of 1K max no. of results?
Carrot2 performs clustering based only on the natural text of your documents. Person names would probably be too short for meaningful clustering; Carrot2 is not suitable for geo-distance and other numerical data.
The 1k restriction / recommendation is based on the design goal of Carrot2: to cluster small collections of texts (such as search results) fast enough so that the process can be done on-line. Carrot2 does well for collections around 1k documents, but will not scale very well beyond several thousands of documents.
Related
SOLR (Lucene) indices, like all inverted indices, use a term dictionary to assign an index to each term. Each field in the index generates its own term dictionary (which can be inspected in the SOLR admin tool).
I have a very large SOLR index, where each document has very many textual fields. All fields contain english text following a similar distribution.
In my case this is very wasteful: it maintains many very large term dictionaries (in memory) which are almost all the same... as the number of (different) terms in the documents grows these dictionaries grow very large.
I cannot combine all fields into a single search field because I need to run queries restricted over specific fields.
Is there a way to tell SOLR to use the same term dictionary for several fields?
(Afterthought: but perhaps if terms follow a zipfian distribution, the ammount of sharing between fields won't be significant anyway as many terms will appear only once and hence only in one dict?)
I've been analyzing the best method to improve the performance of our SOLR index and will likely shard the current index to allow searches to become distributed.
However given that our index is over 400GB and contains about 700MM documents, reindexing the data seems burdensome. I've been toying with the idea of duplicating the indexes and deleting documents as a means to more efficiently create the sharded environment.
Unfortunally it seems that modulus isn't available to query against the document's internal numeric ID. What other possible partitioning strategies could I use to delete by query rather than a full reindex?
A lucene tool would do the job IndexSplitter, see mentioned here with a link to an article (japanese, tranlate it with google...)
If you can find a logical key to partition the data, then it will be helpful in more than one way. For eg. can you have these documents split across shards based on some chronological order?
We have a similar situation. We have an index of 250M docs that are split across various shards based on their created date. A major use case involves searching across these shards based on a range of created date. So, the search is only submitted to the shards that contain the docs with the given date range. There may be other benefits to logically partitioned data - for eg. different capacity planning, applying different qualities of service to search terms etc..
I answered this in another StackOverflow question. There's a command-line utility I wrote (Hash-Based Index Splitter) to split a Lucene index based on each document's ID hash.
We're reimplementing a search that includes locations that need to be clustered on a map. I've been searching without luck for an implementation in SOLR.
The current search with map clustering implemented is at http://www.uship.com/find
Has anyone seen similar or have ideas about how to best do this?
Regards,
Nick
If the requirement is to cluster a fairly small number of points, perhaps less than 1000, then Solr needn't be involved. Grab the points and plot them using something like HeatmapJS.
I presume the requirement is to cluster all results in a search which may potentially be many thousands or even millions of documents. I suggest starting with generating a heatmap of the densities over a grid of the search area. You can do this by indexing each point encoded in geohash form at each length (e.g. D2RY, D2R, D2, D). But then precede the length by how long it is: 4_D2RY, 3_D2R, 2_D2, 1_D. These little strings go into a multi-valued "string" type field in Solr that you will then facet on. When faceting, you'll come up with a suitable grid resolution (e.g. goehash prefix length) and then use that as a prefix query, like facet.prefix=4_ You can index the point using a LatLonType field separately and do a standard bounding box query there. At this point, you're faceted search results will give you the information to fill in a grid of numbers. The beauty of this scheme is that it is fast -- you could generate such heat-maps on the fly. It will use a fair amount of RAM though since this is faceting on a multi-valued field that will have a ton of values. This is something I want to add to the new Lucene spatial module (or perhaps at the Solr layer) in a way that won't need extra memory and to make it easy. It won't make it to Solr 4.0, but maybe 4.1.
At this stage, perhaps a heatmap is fine as-is. But you may want to apply clustering on top of this, as your question states. Someone tipped me off to some interesting geo clustering algorithms that can be applied to heatmaps.
I don't know whether you searched lucidworks, but there are many interesting resources there:
Search with Polygons: Another Approach to Solr Geospatial Search
Go through these:
http://www.lucidimagination.com/search/?q=geospatial#%2Fn
Already implemented in Solr:
http://wiki.apache.org/solr/SpatialSearch/ (what's wrong with this approach?)
http://wiki.apache.org/solr/SpatialSearchDev
https://issues.apache.org/jira/browse/SOLR-3304
Is it better to use
a lot of indexes (eg. for every user as your application allows that)
in Lucene
or just one, having every document in int
... if you think about:
performance
disk space
health
I am using elasticsearch, therefore I am using Lucene.
In Elastic Search, I think based off your information I would use 1 index. My understanding is users are only searching there own documents, and the documents seems to be relatively similar.
Performance - When searching you can use a Filtered Query to filter to only the documents matching the user. The user id filter is cache-able, and fast.
Scalable - In Elasticsearch, you control sharding and replication at index level. Elasticsearch can handle large numbers of indexes, I just think configuring appropriate shards and replications could be valuable for the entire index.
In a single index, you can still easy wipe away data (see delete by query) , and there should be little concern of seeing others data unless you write your queries wrong. A filtered query with that filters results to only those associated with a user id is very simple. Similar in complexity to searching a different index per user.
Your exact needs might fit a different approach better. Based what I have so far, I would do choose one index though.
We have many years of weather data that we need to build a reporting app on. Weather data has many fields of different types e.g. city, state, country, zipcode, latitude, longitude, temperature (hi/lo), temperature (avg), preciptation, wind speed, date etc. etc.
Our reports require that we choose combinations of these fields then sort, search and filter on them e.g.
WeatherData.all().filter('avg_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
or
WeatherData.all().filter('lo_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
May be easy to see that these queries require different indexes. May also be obvious that the 200 index limit can be crossed very very easily with any such data model where a combination of fields will be used to filter, sort and search entities. Finally, the number of entities in such a data model can obviously run into millions considering that there are many cities and we could do hourly data instead of daily.
Can anyone recommend a way to model this data which allows for all the queries to still be run, at the same time staying well under the 200 index limit? The write-cost in this model is not as big a deal but we need super fast reads.
Your best option is to rely on the built-in support for merge join queries, which can satisfy these queries without an index per combination. All you need to do is define one index per field you want to filter on and sort order (if that's always date, then you're down to one index per field). See this part of the docs for details.
I know it seems counter-intuitive but you can use a full-text search system that supports categories (properties/whatever) to do something like this as long as you are primarily using equality filters. There are ways to get inequality filters to work but they are often limited. The faceting features can be useful too.
The upcoming Google Search API
IndexTank is the service I currently use
EDIT:
Yup, this is totally a hackish solution. The documents I am using it for are already in my search index and I am almost always also filtering on search terms.