How to implement a complex token-matching algorithm in SOLR - solr

Problem Description
I'm trying to implement a custom algorithm to match user provided free-text input, a company name such as "Ford Motor", against a reference data source consisting of 1.4 million company names.
The algorithm executes following steps:
Step 1) Performs an "Exact Match", followed by "Begins Match" and finally "Contains Match" of user provided search input. Results from this step are also sorted in the same order.
Step 2) Performs a token by token match of search input with reference company name.
Every token is matched in following order: Exact, Begins, Contains, Levenshtein Distance (< 0.2) and Refined Soundex.
E.g. If user input is "Foord Motur Holding" and it's being matched against "The Ford Motor Holdings Company" then first token "Foord" will match "Ford" based on Soundex match, second token "Motur" will match "Motor" based on Edit Distance Algo and and last token "Holding" will match "Holdings" via Begins match.
Scoring:
Every token match is first scored on a scale that rates the matching technique, with Exact match being the best and Soundex being the worst.
The overall score is calculated, on a scale of 0-100%, by calculating a weighted average of individual token-match scores. Weights are assigned based on index-order of token i.e. the first token has highest weight and last token has lowest.
My Partial Solution
I have implemented a simple schema in solr to store referance company names. A String field (called companyName), a simple text field (called as companyText) copied from string and another text field (called as companySoundex) copied from string and using PhoneticFilterFactory for Refined Soundex based matching.
I have been able to replicate step 1) in a single solr query.
For step 2) I plan to fire 3 parallel queries to solr server. First query performing a simple text search on companyText field, second query performing fuzzy match using ~ operator on companyText field and third query performing soundex match on companySoundex field. I plan to somehow combine the results from these 3 parallel queries to get desired final result.
Questions:
1) Is there a better way to replicate Step 2) of original algorithm?
2) Even if I go with my "three-parallel-queries" approach then how to get the "right" sorting order as I get in the original algorithm ?
I guess the main problem is how to compare the solr scores from these 3 entirely different queries to do the final combining of results
Thanks for reading this long question. Any help/pointers would be greatly appreciated.

Look at the DisMax query parser. http://wiki.apache.org/solr/DisMaxRequestHandler
For each separate query, you'll actually build up separate fields in the index for matching. Then use DisMax to combine the queries in a weighted fashion.
I suggest giving up on your 3 parallel queries approach now. Last time I looked into this it was impossible to relate scores from 2 separate queries. It just doesn't work. If you want a single set of results sorted by score, you have to figure out how to do this in a single query.

IMHO, This functionality can not be achieved in out of the box handlers that Solr provides. You should be better with writing a custom query handler that handles and scores the results in this manner.

Related

Indexing and searching words and word-parts

I just indexed a bunch of text data from our products DB. My goal is evaluating Apache Solr for production use.
This is a document example:
{
"shape":"Geometric",
"color":"MATTE BLACK",
"gender":"unisex",
"model":"CLUBMASTER RX 5154",
"sales":10,
"lens":"rugged",
"material":"plastic",
"brand":"Ray-Ban"
}
The most important thing in our search app is fuzzy matching, because inaccurate search terms are very frequent.
So, I'm a little disappointed with results found by Solr.
For example:
clubmaster -> many results
club master -> no results
Why?!
ray ban -> many results
rayban -> no results
I also tried putting ~1 or even ~2 after my term, with no luck!
All fields are indexed '*_txt_en' predefined field.
You can't just run a serious production setup without customizing schema/solrconfig to fit your specific needs. From what I can guess, you would get the results you want by:
copy your text fields into different versions with different analysis, for example:
one as a string type, hard to match
one field that is using EdgeNgram to match prefixes.
another with WordDelimiterFilterFactory to match ray-ban/rayban
...
using edismax as the query parser
in edismax, there are many things to tweak in it. But the most important is: search on all the fields above, but weight then in different way, the less analysis, the more weight

how to get resultcounts for each word if multiword-search was without results

On our webshop I want to implement a feature which should do the following:
If a user e.g. searches for "phone magnum", there will be no results.
If there were no results I want to give him the possibility to see
that search for "phone" will give him 139 results
and search for "magnum" will get 12 results.
I don't want to start several queries only for getting those counts. But at the moment I have no Idea how to do that.
I read the Solr-wiki for faceting, but didn't find anything useful for my problem. Maybe I missed something ....
Not sure why you want to avoid multiple queries. If your first search on the phrase "phone magnum" does not return any results, you could issue one query per search keyword with rows=0 which will give you only the counts. This should be efficient, since you are not building any result documents and only getting the result counts.
However, if you really want to avoid the subsequent queries, here is one apporach: Have a field in your index which does not take IDF into account. (See this on how to do that.) Once that field is available (call it say name_no_idf) issue a query against this field name_no_idf:(phone magnum). Notice that this is not a phrase search.
The documents which contain both phone and magnum in the name_no_idf field will get a score of 2, while the docs matching only one word will get a score of 1. To this query you add facet=true&facet.field=name. Then the facet counts you get for these two words will be the counts you are looking for. But few warnings:
if one of the words is very infrequent, you may need to increase facet.limit
facet queries are expensive

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)
Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.
You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

how can I limit by score before sorting in a solr query

I am searching "product documents". In other words, my solr documents are product records. I want to get say the top 50 matching products for a query. Then I want to be able to sort the top 50 scoring documents by name or price. I'm not seeing much on how to do this, since sorting by score, then by name or price won't really help, since scores are floats.
I wouldn't mind if I could do something like map the scores to ranges (like a score of 8.0-8.99 would go in the 8 bucket score), then sort by range, then by names, but since there is basically no normalization to scoring, this would still make things a bit harder.
Tl;dr How do I exclude low scoring documents from the solr result set before sorting?
You can use frange to achieve this, as long as you don't want to sort on score (in which case I guess you could just do the filtering on the client side).
Your query would be something along the lines of:
q={!frange l=5}query($qq)&qq=[awesome product]&sort=price asc
Set the l argument in the q-frange-parameter to the lower bound you want to filter score on, and replace the qq parameter with your user query.
As observed by Karl Johansson, you could do the filtering on the client side: load the first 50 rows of the response (sorted by score desc) and then manipulate them in JS for example.
The jQuery DataTables plugin works fantastically for that kind of thing: sorting, sorting on multiple columns, dynamic filtering, etc. -- and with only 50 rows it would be very fast too, so that users can "play" with the sorting and filtering until they find what they want.
I don't think you can simply
exclude low scoring documents from the
solr result set before sorting
because the relevance score is only meaningful for a given combination of search query and resulting document list. I.e. scores are only meaningful within a given search and you cannot set some threshold for all searches.
If you were using Java (or PHP) you could get the top 50 documents and then re-sort this list in your programming language but I don't think you can do it with just SOLR.
Anyway, I would recommend you don't go down this route of re-sorting the results from SOLR, as it will simply confuse the user. People expect search results to be like Google (and most other search engines), where results come back in some form of TFIDF ranking.
Having said that, you could use some other criteria to separate documents with the same relevance scores by adding an index-time boost factor based on a price range scale.
I'd suggest you use SOLR to its strengths and use facets. Provide a price range facet on the left (like Ebay, Amazon, et al.) and/or a product category facet, etc. Also provide a "sort" widget to allow the results to be sorted by product name, if the user wants it.
[EDIT] this question might also be useful:
Digg-like search result ranking with Lucene / Solr?

Resources