I have a database of over a million contacts and need to return the best matches for a) user queries and b) batch jobs that run periodically. Not much debate that people name matching is complex and I am considering different routes:
Roll our own (give us something basic to get us out of the blocks). Lots of good threads on this topic, such as How to calculate score for Metaphone/Soundex name searching in .net
Leverage Azure Search / Cognitive Skills: Our platform is already built in Azure and using Azure Search would potentially be less work that (1) and a smaller jump than (3)
Look to 3rd parties outside of Azure that specialise in the space of people name matching (NetOwl / Basistech / etc.).
Given we are scoped to solving the name matching for western style people names, can someone give me the pros and cons of using Azure Search to solve this? Here are some of classes of issues I hope we can address:
Phonetic similarity: Jesus <=> Heyzeus
Transliteration spelling differences: Abdul Rasheed <=> Abd al-Rashid
Alternate names: William <=> Will <=> Bill <=> Billy
Missing spaces or hyphens: MaryEllen <=> Mary Ellen <=> Mary-Ellen
Truncated name components: McDonalds <=> McDonald <=> McD
Optional name tokens: Joaquín Archivaldo Guzmán Loera <=> Joaquín Guzmán
Name order variations: Park Sol Mi <=> Sol Mi Park
Initials: J. E. Smith <=> James Earl Smith
Thanks in advance for any guidance and help.
Simon.
Interesting case! I believe there is no right or wrong answer to this solution and it will also depend on budget and time constraints. What is your primary datasource? Are you using a supported source for the Azure Cognitive Search indexer, like SQL or CosmosDB. How are the contact stored? First and last name separated or is everything in just one field?
Since you are mostly looking for guidance around Azure Cognitive Search, I will describe how I would try to tackle this case with Azure Cognitive Search. Hopefully it will help you in deciding which technology suits your purpose the best.
I don't have experience in all cases, please comment on this post if you have better suggestions and I will update it. There are a few similar topics where they are using different technology, but with the same Lucene query syntax and some of the tokenizers.
Phonetic similarity: Jesus <=> Heyzeus
You could add the PhoneticTokenFilter, where you can select the encoder with the best performance for your specific case.
Transliteration spelling differences: Abdul Rasheed <=> Abd al-Rashid
Fuzzy search could be an option, however the example above is just too different.
Alternate names: William <=> Will <=> Bill <=> Billy
You could use SynonymMaps if you have this data.
Missing spaces or hyphens: MaryEllen <=> Mary Ellen <=> Mary-Ellen
You could possibly use a tokenizer that will remove whitespace and punctuation/symbols.
Truncated name components: McDonalds <=> McDonald <=> McD
You could use SynonymMaps if you have this data. However I think fuzzy search could do the job already.
Optional name tokens: Joaquín Archivaldo Guzmán Loera <=> Joaquín Guzmán
You could leverage Proximity search.
Name order variations: Park Sol Mi <=> Sol Mi Park
Also depends on how the fields are stored, but I think proximity search could solve this case also.
Initials: J. E. Smith <=> James Earl Smith
You could possibly use a tokenizer in combination with Fuzzy Search.. Not sure about this case.
A nice addition is that you can also offer suggestions and/or autocomplete to show the user possible results during typing.
My answers won't solve all cases directly, but it will give you a start. You will have to test and tweak it a lot, thus you should have a look at the time / budget constraint.
Related
All:
I wonder if there is any way that we can use lucene to do search keyword relevancy discovering based on search history?
For example:
The code can read in user search string, parse it, extract the keyword and find out which words have most possibility to come together when search.
When I try Solr, I found that the lucene has a lot of text analysis feature, that is why I am wondering if there is any way we can use it and combine with other machine learning libs(if necessary) to achieve my goal.
Thanks
Yes and No.
Yes.
It should work. Simply treat every keyword as a document and then use MoreLikeThis feature of lucene, which constructs a lucene query on the fly based on terms within the raw query. The lucenue query is then used to find other similar documents (keywords) in the index.
MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"keywords"}); // specify the field for similarity
Query query = mlt.like(docID); // Pass the doc id
TopDocs similarDocs = searcher.search(query, 20); // Use the searcher
if (similarDocs.totalHits == 0)
// Do handling
}
Suppose in your indexed keywords, you have such keywords as
iphone 6
apple iphone
iphone on sale
apple and fruit
apple and pear
when you launch a query with "iphone", I am sure you will find the first three keywords above as "most similar" due to the full term match of "iphone".
No.
The default similarity function in lucene never understands that iphone is relevant to Apple Inc, thus iphone is relevant to "apple store". If your raw query is just "apple store", an ideal search result within your current keywords would be as follows (ordered by relevancy from high to low):
apple iphone
iphone 6
iphone on sale
unfortunately, you will get below results:
apple iphone
apple and fruit
apple and pear
The first one is great however the other two are totally unrelated. To get the real relevancy discovery (using the semantic) , you need more work to do topic modeling. If you happen to have a great way (e.g., a pre-trained LDA model or wordvec ) to pre-process each keyword and produce a list of topic ids, you can store those topic ids in a separate field with each keyword document. Something like below:
[apple iphone] -> topic_iphone:1.0, topic_apple_inc:0.8
[apple and fruit] -> topic_apple_fruit:1.0
[apple and pear] -> topic_apple_fruit:0.99, topic_pear_fruit:0.98
where each keyword is also mapped to a few topic ids with weight value.
At query time, you should run the same topic modeling tool to generate topic ids for the raw query together with its terms. For example,
[apple store] -> topic_apple_inc:0.75, topic_shopping_store:0.6
Now you should combine the two fields (keyword and topic) to compute the overall similarity.
I am trying to compare two documents in solr (say Doc A, Doc B), based on a common "name" field using solr query. Based on query A.name I get a result document B with a relevancy score of say SCR1. Now if i do it in the reverse way, i.e I query with B.name and i get the document A in somewhere in the result, but this time score of B with A is not the same SCR1.
I believe this is happening because of the no. of terms in Doc A.name and Doc B.name are different so similarity score is not same. Is it the reason for this difference?
Is there anyway I can get same score either way (as described above)?
Is it not possible to compare score of any any two queries?
Is it possible to do this in native Lucene APIs?
To answer your second question, scores of two documents must not be compared.
A similar question was posted in the java-users lucene mailing list.
Here's a link to it: Compare scores across queries
An explanation is given there as why one must not do that.
I'm not quite sure I'm clear on the queries you are referring to, but let's say the situation is something like this:
Doc A: Name = "Carlos Fernando Luís Maria Víctor Miguel Rafael Gabriel Gonzaga Xavier Francisco de Assis José Simão de Bragança, Sabóia Bourbon e Saxe-Coburgo-Gotha"
Doc B: Name = "Tomás António Gonzaga"
If you search for "gonzaga", Doc B will be given the higher score, since, while there is one match in each name, Doc B has a much shorter name, with only three terms, and shorter fields are weighed more heavily. This is the LengthNorm refered to in the TFIDFSimilarity documentation.
There are other factors though. If we just chuck each name into the queryparser, and see what comes up, something like:
Query queryA = queryparser.parse(docA.name);
Query queryB = queryparser.parse(docB.name);
Then the queries generated are much different:
name:carlos name:fernando name:luis name:maria name:victor name:miguel name:rafael name:gabriel name:gonzaga name:xavier name:francisco name:de name:assis name:jose name:simao name:de name:braganca name:baboia name:bourbon name:e name:saxe name:coburgo name:gotha
vs
name:tomas name:antonio name:gonzaga
there are a wealth of reasons why these would generate different scores. The lengthNorm discussed above, the coord factor, which boosts results which match more query terms would very likely come into play, tf, which weighs documents with more matches for a term more heavily, idf, which prefers terms that appear less frequently over the entire index, etc. etc.
Scores are only relevant to the result set of a query run. A change to the query, or to the state of the index can lead to different scores, and they are not intended to be comparable. You can use IndexSearcher.explain, to understand how a score was calculated.
I have the following scenario. Suppose I have table a big table like this.
Id(unique) returnMe desc name value
1 user1 all those living in usa country USA
2. user2 all those like game game football
3. user1 my hobbies are hobby guitar
Now, how can I get results (returnMe) for following queries.
1. For all those users who live in usa AND like guitar
2. For all those users who live in usa OR like guitar.
Please donot modify query in anyway.
For my solConfig.xml 'desc' , 'name' , 'value' are searchable , indexable fields.
Thanks for any help.
Well I am editing this to explain my logic ..
Step 1: Break query on AND like (live in USA) AND (like guitar)
Step 2: Then select returnMes from first query and returnMes from second query.
Step 3: Take common returnMes, returned from first query and second query.
Is there any way Solr can do that. Can we do it through Solr "join" or not or some otherway ??
I do want to do that in my PHP , it would be massive overhead.
You are going to need to modify the query in some way. A simple step to parse the query and add parentheses to it, and possibly field names to search. You could reasonably easily transform those queries into something like:
(For all those users who live in usa) AND (like guitar)
(For all those users who live in usa) OR (like guitar)
or perhaps you can cut out "for all those users who" and have simply:
(live in usa) AND (like guitar)
(live in usa) OR (like guitar)
And set the query field to value. Of course, you could run into issues, if you had a document with value=users, or something of that nature, since it will search for each term present in the value field.
If you really want to be able to work with natural language, than you can take look at the OpenNLP project.
I have a series of docs containing a nickname ( even with spaces ) and an ID.
The nickname can be like ["example","nick n4me", "nosp4ces","A fancy guy"].
I have to find a query that allow me to find profiles by a perfect matching, a fuzzy, or event with partial character.
So if a write down "nick" or "nick name" or "nick name", the document "nickname" has always to come out.
I tried with something like:
nickname:(%1%^4 %1%~^3 %1%*^1)
where "%1%" is what I'm searching, but it doesn't work, especially for spaces or numbers nicknames. For example if i try to search "nick n" the query would be:
nickname:(nick n^4 nick n~^3 nick n*^1)
Boosting with ^ will only affect the scoring and not the matching, i.e. if your query does not match at all, boosting the terms or not won't make any difference.
In your specific example, the query won't match because:
1) nick n won't match because that would require that either the token nick or n have been tokenized;
2) EDIT: I found out that fuzzy queries work only on single terms, if you use the standard query parser. In your case, you should probably rewrite nick n~ using ComplexPhraseQueryParser, so you can do a fuzzy query on the whole PhraseQuery. Also, you can specify a threshold for your fuzzy query (technically, you are specifying a minimum Levenshtein distance). Obviously you have to adjust the threshold, and that usually requires some trial and error.
An easier tactic is to load all nicknames into one field -- in your example you would have 4 values for your nickname field. If you want embedded spaces in your nicknames, you will need to use a simpler analyzer than StandardAnalyzer or use phrase searches.
I am trying to search a SQL Server 2008 table (containing about 7 million records) for cites and countries based on a user input type text. The search string that I get from the user can be anything like:
"Hotels in San Francisco, US" or "New York, NY" or "Paris sddgdfgxx" or "Toronto Canada" terms are not allways separated by comma and not in a specific order and there might be unusefull data.
This is what I tried:
Method 1: FTS with contains:
ex: select * from cityNames where contains(cityname,'word1 and word2') -- with AND
select * from cityNames where contains(cityname,'word1 or word2') -- with OR
This didn't work very well because a term like 'sddgdfgxx' would return nothing if used with 'AND'. Using OR will work for one word cities like 'Paris' but not for 'San Diego' or 'San Francisco'
Method 2: this is actually a reverse search, the logic of it is to search if the user imput string contains any of the cities or countries from my table. This way I'll know for sure that 'Aix en Provence' or 'New York' was searched for.
ex: select * from cityCountryNames where 'Ontario, Canada, Toronto' like cityCountryNames
notes: I wasn't able to get results for two words cities and the query was slow.
Any help is appreciated.
I would strongly recommend using a 3rd-party API like the Google Geocoding API to take such input and parse it into a location with discrete parts (street address, city, state, country, etc.) Then you could use those discrete parts to search your database if necessary.
Map services like Google and Bing have solved this problem way better than you or I ever would, so why not leverage all the work they've done?
SQL isn't designed for the kinds of queries you are performing, certainly not scale.
My recommendation would be as follows:
Index all your places (cities + countries) into a Solr Index. Solr is a FOSS search server built using Lucene and can easily query the 7MM records index in milliseconds or less.
Query solr with the user typed string and voila the first match is the best match.
So even if the user typed "Paris sddgdfgxx", Paris should be your first hit. If you want to get really sophisticated use an n-gram approach (known as Lucene Shingles)
Since Solr offers a RESTful (HTTP) API should easily integrate into whatever platform you are on.