Handling Full Text Permutations SQL Server - sql-server

We have a database table that holds a list of countries and basic information about them, that we use Full Text Indexing to return data to the user through a web app. I am looking into country names and the different permutations (spellings) that are possible. I will use the below country for instance.
St. Martin
Saint Martin
St. Marteen
Sint Maarten
As you can see, depending on the regions, the user can enter any one of those requests and expect to get the same result. I imagine to allow this to work, I lookup table of sorts is needed be able to perform the contains() against. I just wanted to know if there is a blog post, or "Best Practices" way to go about doing this.
Let me know if you have any questions.

Try customizing the thesaurus file. See Configure and Manage Thesaurus Files for Full-Text Search.

Related

information extraction about a person from a document

extract information about a particular person from a document which may contain information about many people. Statements like "he works for XYZ COMPANY", should also be considered for that particular person. Also Nick names should be considered.
I have tried using NLTK and Spacy and have managed to extract entities from the document. I am not sure how to proceed.
Try using a more complete NER library, maybe standford coreNLP can help you, https://stanfordnlp.github.io/CoreNLP/

Making solr to understand English

I'm trying to setup solr that should understand English. For example I've indexed our company website (www.biginfolabs.com) or it could be any other website or our own data.
If i put some English like queries i should get the one word answer just what Google does;queries are:
Where is India located.
who is the father of Obama.
Workaround:
Integrated UIMA,Mahout with solr(person name,city name extraction is done).
I read the book called "Taming Text" and implemented https://github.com/tamingtext/book. But Did not get what i want.
Can anyone please tell how to move further. It can be anything our team is ready to do it.
This task is called Named Entity Recognition. You can look up this tutorial to see how they use Solr for extractic atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. and then learning a model to answer queries.
Also have a look at Stanford NLP for more ideas on algorithms that you can use.

Sql Server Full Text: Human names which sound alike

I have a database with lots of customers in it. A user of the system wants to be able to look up a customer's account by name, amongst other things.
What I have done is create a new table called CustomerFullText, which just has a CustomerId and an nvarchar(max) field "CustomerFullText". In "CustomerFullText" I keep concatenated together all the text I have for the customer, e.g. First Name, Last Name, Address, etc, and I have a full-text index on that field, so that the user can just type into a single search box and gets matching results.
I found this gave better results that trying to search data stored in lots of different columns, although I suppose I'd be interested in hearing if this in itself is a terrible idea.
Many people have names which sound the same but which have different spellings: Katherine and Catherine and Catharine and perhaps someone who's record in the database is Katherine but who introduces themselves as Kate. Also, McDonald vs MacDonald, Liz vs Elisabeth, and so on.
Therefore, what I'm doing is, whilst storing the original name correctly, making a series of replacements before I build the full text. So ALL of Katherine and Catheine and so on are replaced with "KATE" in the full text field. I do the same transform on my search parameter before I query the database, so someone who types "Catherine" into the search box will actually run a query for "KATE" against the full text index in the database, which will match Catherine AND Katherine and so on.
My question is: does this duplicate any part of existing SQL Server Full Text functionality? I've had a look, but I don't think that this is the same as a custom stemmer or word breaker or similar.
Rather than trying to phonetically normalize your data yourself, I would use the Double Metaphone algorithm, essentially a much better implementation of the basic SOUNDEX idea.
You can find an example implementation here: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574, and more are listed in the Wikipedia link above.
It will generate two normalized code versions of your word. You can then persist those in two additional columns and compare them against your search text, which you would convert to Double Metaphone on the fly.

How to implement an Enterprise Search

We are searching disparate data sources in our company. We have information in multiple databases that need to be searched from our Intranet. Initial experiments with Full Text Search (FTS) proved disappointing. We've implemented a custom search engine that works very well for our purposes. However, we want to make sure we are doing "the right thing" and aren't missing any great tools that would make our job easier.
What we need:
Column search
ability to search by column
we flag which columns in a table are searchable
Keep some relation between db column and data
we provide advanced filtering on the results
facilitates (amazon style) filtering
filter provided by grouping of results and allowing user to filter them via a checkbox
this is a great feature, users like it very much
Partial Word Match
we have a lot of unique identifiers (product id, etc).
the unique id's can have sub parts with meaning (location, etc)
or only a portion may be available (when the user is searching)
or (by a decidedly poor design decision) there may be white space in the id
this is a major feature that we've implemented now via CHARINDEX (MSSQL) and INSTR (ORACLE)
using the char index functions turned out to be equivalent performance(+/-) on MSSQL compared to full text
didn't test on Oracle
however searches against both types of db are very fast
We take advantage of Indexed (MSSQL) and Materialized (Oracle) views to increase speed
this is a huge win, Oracle Materialized views are better than MSSQL Indexed views
both provide speedups in read-only join situations (like a search combing company and product)
A search that matches user expectations of the paradigm CTRL-f -> enter text -> find matches
Full Text Search is not the best in this area (slow and inconsistent matching)
partial matching (see "Partial Word Match")
Nice to have:
Search database in real time
skip the indexing skip, this is not a hard requirement
Spelling suggestion
Xapian has this http://xapian.org/docs/spelling.html
Similar to google's "Did you mean:"
What we don't need:
We don't need to index documents
at this point searching our data sources are the most important thing
even when we do search documents, we will be looking for partial word matching, etc
Ranking
Our own simple ranking algorithm has proven much better than an FTS equivalent.
Users understand it, we understand it, it's almost always relevant.
Stemming
Just don't need to get [run|ran|running]
Advanced search operators
phrase matching, or/and, etc
according to Jakob Nielsen http://www.useit.com/alertbox/20010513.html
most users are using simple search phrases
very few use advanced searches (when it's available)
also in Information Architecture 3rd edition Page 185
"few users take advantage of them [advanced search functions]"
http://oreilly.com/catalog/9780596000356
our Amazon like filtering allows better filtering anyway (via user testing)
Full Text Search
We've found that results don't always "make sense" to the user
Searching with FTS is hard to tune (which set of operators match the users expectations)
Advanced search operators are a no go
we don't need them because
users don't understand them
Performance has been very close (+/1) to the char index functions
but the results are sometimes just "weird"
The question:
Is there a solution that allows us to keep the key value pair "filtering feature", offers the column specific matching, partial word matching and the rest of the features, without the pain of full text search?
I'm open to any suggestion. I've wondered if a document/hash table nosql data store (MongoDB, et al) might be of use? ( http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo ). Any experience with these is appreciated.
Again, just making sure we aren't missing something with our in-house customized version. If there is something "off the shelf" I would be interested in it. Or if you've built something from some components, what components (search engines, data stores, etc) did you use and why?
You can also make your point for FTS. Just make sure it meets the requirements above before you say "just use Full Text Search because that's the only tool we have."
I ended up coding my own.
The results are fantastic. Users like it, it works well with our existing technologies.
It really wasn't that hard. Just took some time.
Features:
Faceted search (amazon, walmart, etc)
Partial word search (the real stuff not full text)
Search databases (oracle, sql server, etc) and non database sources
Integrates well with our existing environment
Maintains relations, so I can have a n to n search and display
--> this means I can display child records of a master record in search results
--> also I can search any child field and return the master record
It's really amazing what you can do with dictionaries and a lot of memory.
I recommend looking into Solr, I believe it will meet you needs:
http://lucene.apache.org/solr/
For an off-she-shelf solution: Have you checked out the Google Search Appliance?
Quote from the Google Mini/GSA site:
... If direct database indexing is a requirement for you, we encourage you to consider the Google Search Appliance, which has direct database connectivity.
And of course it indexes everything else in the Googly manner you'd expect it to.
Apache Solr is a good way to start your project with and it is open source . You can also try Elastic Search and there are a lot of off shelf products which offer good customization abilities and search features such as Coveo, SharePoint Fast, Google ...

Tips on how to improve full text search for search engine

I'm developing: http://www.buscatiendas.com.mx
I've seen people entering text for queries with lots of typos.
What kind of search could i implement so similar words are found?
Like google does more or less would be neat.
I'm using SQL Server Full Text search.
Why don't you have google/bing index it for you and just use that using the site: feature provided by them?
If that is not an option, you might have to have one of your own 'spell checkers' (either implement yourself or just use an existing one), which is trained on the data you have. Note spell checking is not deterministic (for eg: latel, is it label? later?). You can only make a 'best' guess based on the actual data you have in your site.
There are probabilistic models where you can 'train' your spell guesser/checker to come up with the a 'best' guess.
The following page seems pretty useful. It has a description on how to write one yourself, and also has good links (including a survey paper) and links to implementations in different languages:
http://norvig.com/spell-correct.html.
There are two ways to solve this:
Buy a 3rd party product, like a google search applicance, or one of
Microsoft search servers.
Log all queries, and have someone review these, making a table which
links the bad queries to what they
should be. (It's possible you could
buy a component library which does
this, much like a
spelling checker.)
if you want to roll out your own, first u need to filter out noise words before u even start searching because this may just impose load on your database unnecessarily. should "a good book" be the same as searching for "the good book" or "his good book" or "good and bad reviews on a book"? so obviously, "a", "the", "an", "and", etc. do not at at all qualify as "useful" search keywords. once u got the "noise" filtered out, then u start the real searching. again, u should consider database performance. is it wise to search a dynamic database or a pre-precessed database? figure out a way to filter out the noise words in the search data too.

Resources