Solr statistical information - solr

Is that possible to get some kind of stats from solr. E.g. Most frequently used words (unigrams), or phrases (bi- trigrams)?

Take a look at the schema browser (e.g. http://localhost:8983/solr/admin/schema.jsp), it gives you the top terms for any given field. You can also access this information with the LukeRequestHandler (e.g. http://localhost:8983/solr/admin/luke).
The TermsComponent also gives you information about indexed terms in a field and the number of documents that match each term.
The StatsComponent gives you statististics about numeric fields.

Related

Indexing and searching words and word-parts

I just indexed a bunch of text data from our products DB. My goal is evaluating Apache Solr for production use.
This is a document example:
{
"shape":"Geometric",
"color":"MATTE BLACK",
"gender":"unisex",
"model":"CLUBMASTER RX 5154",
"sales":10,
"lens":"rugged",
"material":"plastic",
"brand":"Ray-Ban"
}
The most important thing in our search app is fuzzy matching, because inaccurate search terms are very frequent.
So, I'm a little disappointed with results found by Solr.
For example:
clubmaster -> many results
club master -> no results
Why?!
ray ban -> many results
rayban -> no results
I also tried putting ~1 or even ~2 after my term, with no luck!
All fields are indexed '*_txt_en' predefined field.
You can't just run a serious production setup without customizing schema/solrconfig to fit your specific needs. From what I can guess, you would get the results you want by:
copy your text fields into different versions with different analysis, for example:
one as a string type, hard to match
one field that is using EdgeNgram to match prefixes.
another with WordDelimiterFilterFactory to match ray-ban/rayban
...
using edismax as the query parser
in edismax, there are many things to tweak in it. But the most important is: search on all the fields above, but weight then in different way, the less analysis, the more weight

How to perform an exact search in Solr

I implementing Solr search using an API. When I call it using the parameters as, "Chillout Lounge", it returns me the collection which are same/similar to the string "Chillout Lounge".
But when I search for "Chillout Lounge Box", it returns me results which don't have any of these three words.(in the DB there are values which have these 3 values, but they are not returned.)
According to me, Solr uses Fuzzy search, but when it is done it should return me some values, which will have at least one these value.
Or what could be the possible changes I should to my schema.XML, such that is would give me proper values.
First of all - "Fuzzy search" is a feature you'll have to ask for (by using ~ in standard Lucene query syntax).
If you're talking about regular searches, you can use q.op to select which operator to use. q.op=AND will make sure that all the terms match, while q.op=OR will make any document that contain at least one of the terms be returned. As long as you aren't using fq for this, the documents that match more terms should be scored higher (as the score will add up across multiple terms), and thus, be shown higher in the result set.
You can use the debug query feature in the web interface to see scores for each term for a document, and find out why the document was returned at all. If the document doesn't match any terms, it shouldn't be returned, unless you're asking for all documents to be returned.
Be aware that the analyzer chain defined for the field you're searching might affect what's considered a match and not.
You'll have to add a proper example to get a more detailed answer.

Solr Custom Boosting if a specific field matches the query

We are trying to implement a very interesting search logic with custom boosting and I am wondering if Solr can support this.
We have the following fields in our index:
Name
Description
Keywords (array)
Each keyword will have an amount(int value) paired to it.
A search is run across Name, description and keywords field. If a keyword matches the search text, the corresponding index must be boosted based on the amount of the matching keyword only.
I've read through Solr DisMax and they can only boost a field using a fixed amount.
My scenario will be to boost the result by X amount based on matching keywords only.
Thanks in advance
The only viable solution i see to this problem (assuming ofcourse you DO NOT know the number of keywords in advance) would be to just make the query as a filter query (to skip the scoring stage), get all documents matching ( a bit problematic), then just sort them on your side using the matched term to build the a java Comparator.
Problems may arise when you get a particularly large number of documents, but you could probably side step this issue by pagination
If you don't have too much different amounts maybe you can try this on index-time:
Store "keywords" in different fields(dynamicfields->boost-*) based on it's amount:
boost-1 = keyword1,keyword4,keyword6 <br/>
boost-10 = keyword2<br/>
boost-100 = keyword5
You can search across all your boost fields(edismax), boost every dynamicfield with his amount in your (e)dismax conf(boost-1^1,boost-10^10,boost-100^100).

How do I create a Solr query that returns results even if one field in my query has no matches?

Suppose I want to create a recommendation system to suggest people you should connect with based off of certain attributes that I know about you and attributes I have about other people that are stored in a Solr index. Is it possible to query the index with a list of attributes (along with boosts for each attribute) and have Solr return scored results even if some of my fields return no matches? The way that I understand that Solr works is that if one of your fields doesn't contain a match in any documents found in your index, you get zero results for the entire query (even if other fields in the query matched) - is that right? What I would hope is that I could query the index and get a list of results back in order of a score given based on how many (and which) fields matched to something, even if some fields have no matches, for example:
Say that there are 2 people documents stored in the index as follows (figuratively):
Person 1:
Industry: Manufacturing
City: Oakland
Person 2:
Industry: Manufacturing
City: San Jose
And say that I perform a pseudo-Solr query that basically says "Search for everyone whose industry is equal to manufacturing and whose city is equal to Oakland". What I would like is to receive both results back in the result set, even though one of the "Persons" does not reside in Oakland. I just want that person to come back as a result with a lower score than Person1. Is this possible? What might a solr query look like to handle this? Assume that I have many more than 2 attributes for each person (so saying that I can use "And" and "Or" in my solr query isn't really feasible.. or is it?) Thanks in advance for your helpful input! (PS I'm using Solr 3.6)
You mention using the AND operator, which is likely your problem.
The default behavior of Lucene, and Solr, query syntax is exactly what you are asking for. A query like:
industry:manufacturing city:oakland
Will match either, with scoring preference on those that match both. See the lucene query syntax documentation
You can use the bq parameter (boost query) does not affect matching, but affects the scores only.
http://localhost:8983/solr/persons/select?q=industry:manufacturing&bq=City:Oakland^2
play with the boosting factor at the end to get the correct balance between matching score, and boosting score.

How can I sort appengine search index results by relevance?

I'm working on a project that uses Google App Engine's text search API to allow users to search for documents that include a words field. I'm sorting using a MatchScorer, which according to the documentation "assigns a score based on term frequency in a document".
When a user enters a query like "business promo", I convert this into a query string that looks like words:business OR words:promo. I would have expected that this would return documents that contain both the words "business" and "promo" before documents that only contain one of the words (since the documentation says it assigns a score based on term frequency in the document). However, I frequently see results that contain only one of the words before documents that contain both.
I've also tried querying using the RescoringMatchScorer, but see the same problem using this scorer.
I've thought about doing separate queries - ones that AND the search terms and ones that OR the search terms - but this would require many queries if the user enters more than two search terms. For example, if I searched for "advanced business solutions", I'd need queries like this to cover all the bases:
words:advanced AND words:business AND words:solutions
words:advanced AND words:business
words:advanced AND words:solutions
words:business AND words:solutions
words:advanced OR words:business OR words:solutions
Does anyone have any hints on how to perform searches that return more relevant results (i.e. more search term matches) before less relevant results?
Perhaps it depends on how you interpret the phrase "term frequency". I think you're interpreting it to mean "how many of my search terms appear in the document". But it could also mean "how many times (any of) the search terms appears in each document", and indeed -- at least according to some simple experiments I've done -- the latter seems to be the actual behavior.
For example, a document that contains the word "business" 20 times and never mentions the word "promo" would be scored higher than a document that contains "business" and "promo" only once each. Does that jibe with the behavior you're seeing?

Resources