Solr as analytics reporting platform - solr

I was trying to use Solr to develop a reporting platform. It is helping in merging data from all different sources and finally rich query API to create report. There are some analytical requirements too. Most of the cases are taken care of except two following scenarios
Pivoting + Range facets : Example use case : If we want to get number of US citizens, Non-US citizens visited europe each year. In collection there is flag USCitizen and date visitedOn.
Faceting on functional query value : If I have createdDate, closedDate. Then I calculate number of days between them using functions. On that calculated number of days value, would like to have faceting.
Any help with above two cases is really helpful to determine if Solr suites for the need.

Related

Apply Solr filter query to only part of the search results

I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.

CakePHP search by relevance

I am developing a job site, where I want to search through job ads by relevance, I have fields such as job title, job_text for example. now lets say a person searches for cakephp, I would like to get results for cakephp first, and then after them say php which also matches, but cakephp is obviously the most relevant. how can I do this?
My suggestion is that you should run multiple queries for the sorting purposes.
For example, first you find the jobs where title is say php using order by title desc, then run query to find jobs where 'php' appears in keywords for jobs, and lastly you can run a query to find jobs where description has the word 'php' in it.
Then you can combine the results for these queries.
The best way I found to do what I was trying to at that stage was to integrate with apache solr or some similar search engine.
If you want to sort by relevance you will have to come up with some criteria of how relevance is defined for you and calculate it. For example if a certain article got more views it might be more relevant than another article because it was seen by more people. Combine that number with a few other variables (average rating for example if there is a rating functionality), calculate a relevance value based on them, store it in your table and order by the relevance value field. Update it every time one of the vars for the calculation changes or do it via a cron job one time per day, it all depends on your requirements and performance.

Faceted search: ElasticSearch/Solr or a simple database query?

Forgive this super basic question, from a search newbie.
I want to implement a site that makes use of faceted search. For example, it's a site with a database of hotels, and I want to allow users to search for hotels within a price range, with a swimming pool, with either three or four stars.
Clearly I can return results to users with a simple database query.
Should I use ElasticSearch or Solr to implement this instead of using a database query? If so, why?
Yes you should use ES or Solr. Reasons: primarily performance and the ability to change (think config) 'types of faceting' easily.
Faceting is no small feat and although you could do it with a RDBMS, to do it fast requires hard thinking. Why do it yourself if you can use the gazillions of hours Solr / ES (+ Lucene) teams have worked to optimize it.
As for the 'types of faceting' I mentioned:
perhaps you want to do hierarchical faceting. Select price-category > display smaller price categories. How are the bucketed: fixed range, evenly distributed, etc. Solr / ES provide these options from within a config.
Perhaps instead you implement price-faceting with a slider with min/max handles? Do you want to display the nr of hotels while you slide (histogram/facetstats in SOlr / ES)
While you've faceted on price, perhaps you still want to know the min and max-value of the priceslider as if you DIDN't filter on price. This is needed if you want to be able to draw the slider-handles proportionally. (see my question on SO as part of considering a switch from Solr to ES: Elasticsearch: excluding filters while faceting possible? (like in Solr) )
faceting on stars? Perhaps you want to show the best price per stars-facet if the user would select that star (again histogram/ stats)
Seriously, don't even consider doing the above with a RDBMS. You'll go insane.
Hope that helps, and yes I'm familiar with the domain :)
Additional questions, just ask.

How/Why are these Solr Queries producing different results?

I'm using Apache Solr and querying an index with a schema that has a text field PostBody, a integer Userid field, and a trie based datetime field MostRecentActivityDate.
I'm attempting to apply query-time boosting to my select query such that more recent posts are boosted by some factor to assist in scoring. My values for this are in attempts to have a timescale of days rather than years as in many online date boosting examples.
The following two queries produce different results, the only thing being different in them is where the "code" for the boosting is actually placed (i.e. prior to or after the field conditionals themselves). In my testing I've also noticed that they both produce different results from when there is no {} boosting code, so its not as if in one case its being ignored.
Is anyone able to explain why they would produce different results? Thanks!
{!boost%20b=recip(ms(NOW,MostRecentActivityDate),1.16e-7,1,1)} (PostBody:"timmy is great and that is a fact") AND !Userid=2
Vs.
(PostBody:"timmy is great and that is a fact") AND !Userid=2 {!boost%20b=recip(ms(NOW,MostRecentActivityDate),1.16e-7,1,1)}
Since this will be very specific to your data, the best way to figure out what is happening, is to turn on query Debugging - via the debugQuery=on parameter of your search. Here are two links that help explain the debug output.
Debugging Search Applications Relevance - Explanations
Why does id:archangel come before id:hawkgirl when querying for "wings"

Indexes for google app engine data models

We have many years of weather data that we need to build a reporting app on. Weather data has many fields of different types e.g. city, state, country, zipcode, latitude, longitude, temperature (hi/lo), temperature (avg), preciptation, wind speed, date etc. etc.
Our reports require that we choose combinations of these fields then sort, search and filter on them e.g.
WeatherData.all().filter('avg_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
or
WeatherData.all().filter('lo_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
May be easy to see that these queries require different indexes. May also be obvious that the 200 index limit can be crossed very very easily with any such data model where a combination of fields will be used to filter, sort and search entities. Finally, the number of entities in such a data model can obviously run into millions considering that there are many cities and we could do hourly data instead of daily.
Can anyone recommend a way to model this data which allows for all the queries to still be run, at the same time staying well under the 200 index limit? The write-cost in this model is not as big a deal but we need super fast reads.
Your best option is to rely on the built-in support for merge join queries, which can satisfy these queries without an index per combination. All you need to do is define one index per field you want to filter on and sort order (if that's always date, then you're down to one index per field). See this part of the docs for details.
I know it seems counter-intuitive but you can use a full-text search system that supports categories (properties/whatever) to do something like this as long as you are primarily using equality filters. There are ways to get inequality filters to work but they are often limited. The faceting features can be useful too.
The upcoming Google Search API
IndexTank is the service I currently use
EDIT:
Yup, this is totally a hackish solution. The documents I am using it for are already in my search index and I am almost always also filtering on search terms.

Resources