How to configure Solr for name searches

How to configure Solr for name searches - solr

I'm trying to create a searchable list of names (1M names+) and need some loose matching for the names. I've tried a few ways of configuring the server/querying the instance, but either ONLY exact matches are returned, or I get huge, inappropriate result sets missing the exact matches.
I'm new to Solr, is there a good example of doing this, or at least a good starting point I can work from to achieve what I need?
Many thanks in advance

Have a look at fuzzy searches. They provide Levenstein distance which is what you refer as "loose matching".

If you want to search similar sounding names you can also check PhoneticFilterFactory

Related

How to save value with wildcard in Solr?

all.
I have the following trouble with Solr. I need to implement "reverse" search with wildcards. I mean I want to keep value like "auto*" and this item should be found with request like "autocar", "autoplan" or "automate". Could someone help me with this, please? Thanks.

If you want to match shorter indexed value (auto) against longer searched value (autobus), you want a custom analysis chain that includes EdgeNGramFilter on the query side only. Then, the incoming search word will get split into possible prefixes and matched against the indexed term.

Searching for words that are contained in other words

Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks

We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.

perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.

SQL Server CONTAINS and highlighting the matches

Contains() with FORMSOF() is great for trying to capture the user's intent while searching, but is there any way to highlight the matches.
If I search for "said", it might return texts containing "says, say, spoke" etc. Is there a way a way I can highlight the match in the results, or Is there a way to surround the match with underscores? So I might get
She _says_ yes.
I _say_ my name.
We _spoke_ for hours but he didn't _say_ much.
I've considered an after-the-fact (client-side) regex solution that would esentially remove commmon word endings like (e|ed|es|s|ing) and then look for my results with with all those options (so bakes would become bak and then I'd search for bak[a-z]?(s|d|es|ed|ing) and that works okay for words like that, but there's a whole lot of cases where the past tenses don't follow that formula, like speak vs spoke and spake.

There are two SQL Server functions that can help you with this:
The Soundex function help you to compare similar words.
and the difference function helps you to evaluate the difference.

Sunspot/Solr: word concatenation

I'm using Solr with the Sunspot Ruby gem. It works great, but I'm noticing that sometimes users will get poor search results because they have concatenated their search terms (e.g. 'foolproof') where the document text was 'fool proof'. Or vice-versa.
I was going to try and address this by creating a set of alternate match fields by manually concatenating the words from the source documents together. This seems kind of hackish, and implementing the other side (breaking up user concatenations into words) is not obvious.
Is there a way to do this properly in Solr/Sunspot?

Did yo have a look at SOLR spellcheck (or spell check) component?
http://wiki.apache.org/solr/SpellCheckComponent
For example, there is a WordBreakSolrSpellChecker, which may provide valid suggestions in such case.

Terms Prevalence in SolR searches

Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.

For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails

When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to configure Solr for name searches - solr

Have a look at fuzzy searches. They provide Levenstein distance which is what you refer as "loose matching".

If you want to search similar sounding names you can also check PhoneticFilterFactory

Related

How to save value with wildcard in Solr?

Searching for words that are contained in other words

SQL Server CONTAINS and highlighting the matches

Sunspot/Solr: word concatenation

Terms Prevalence in SolR searches

Categories

Resources