Tips on how to improve full text search for search engine - sql-server

I'm developing: http://www.buscatiendas.com.mx
I've seen people entering text for queries with lots of typos.
What kind of search could i implement so similar words are found?
Like google does more or less would be neat.
I'm using SQL Server Full Text search.

Why don't you have google/bing index it for you and just use that using the site: feature provided by them?
If that is not an option, you might have to have one of your own 'spell checkers' (either implement yourself or just use an existing one), which is trained on the data you have. Note spell checking is not deterministic (for eg: latel, is it label? later?). You can only make a 'best' guess based on the actual data you have in your site.
There are probabilistic models where you can 'train' your spell guesser/checker to come up with the a 'best' guess.
The following page seems pretty useful. It has a description on how to write one yourself, and also has good links (including a survey paper) and links to implementations in different languages:
http://norvig.com/spell-correct.html.

There are two ways to solve this:
Buy a 3rd party product, like a google search applicance, or one of
Microsoft search servers.
Log all queries, and have someone review these, making a table which
links the bad queries to what they
should be. (It's possible you could
buy a component library which does
this, much like a
spelling checker.)

if you want to roll out your own, first u need to filter out noise words before u even start searching because this may just impose load on your database unnecessarily. should "a good book" be the same as searching for "the good book" or "his good book" or "good and bad reviews on a book"? so obviously, "a", "the", "an", "and", etc. do not at at all qualify as "useful" search keywords. once u got the "noise" filtered out, then u start the real searching. again, u should consider database performance. is it wise to search a dynamic database or a pre-precessed database? figure out a way to filter out the noise words in the search data too.

Related

Thought on creating a searchable database

I guess what I need is two things. First a way to input data into an Excel like application or a form builder, then a way to search those entries. For example.. CAR PART put a car Part A into Field 1 the next Field 2 would be car Type, followed by make and model. The fields would need to be made into a form consisting of preset inputs such as ( Title/Type ) and (Variable Categories) so a drop down menu, icons, or checkboxes would help narrow down the list of results. What pieces need to be in place to build/use a lightweight database/application design like this that allows inputting new information and then being able to search for latet search for variables? Also is there any application that does this already, a programming code to learn, or estimated cost and requirements to have it built?
First, there might be something off the shelf that does this already, and there are applications like this. Microsoft's Access would be a good place to start to see if it would fit your needs -- you can build forms and store data without much programming effort. As time goes on, you can scale up to a SQL Server.
It's not clear to me if your data is relational or not, and it might not matter much at first (any database will likely handle your queries to start). I originally thought your data was not relational, but re-reading your post, I'm not so sure now.
If that doesn't work, or you want more flexibility, then I'd start looking at NoSQL as an option. Some good choices include Mongo and RavenDB (there are many others).
You can program it yourself with just about any major language -- some provide more or less functionality based on the tie-in to the data.

short text syntactic classification

I am newbie at machine learning and data mining. Here's the problem: I have one input variable currently which is a small text comprises of non-standard nouns and want to classify in target category. I have about 40% of total training data from entire dataset. Rest 60% we would like to classify as accurately as possible. Followings are some input variables across multiple observations those are assigned 'LEAD_GENERATION_REPRESENTATIVE' title.
"Business Development Representative MFG"
"Business Development Director Retail-KK"
"Branch Staff"
"Account Development Rep"
"New Business Rep"
"Hong Kong Cloud"
"Lead Gen, New Business Development"
"Strategic Alliances EMEA"
"ENG-BDE"
I think above give idea what I mean by non-standard nouns. I can see here few tokens that are meaningful like 'development','lead','rep' Others seems random without any semantic but they may be appearing multiple times in data. Another thing is some tokens like 'rep','account' can appear for multiple category. I think that will make weighting/similarity a challenging task.
My first question is "is it worth automating this kind of classification?"
Second : "is it a good problem to learn machine learning classification?". There are only 30k such entries and handful of target categories. I can find someone to manually do that which will also be more accurate.
here's my take on this problem so far:
Full-text engine: like solr to build index and query rules that draws matches based on tokens - word, phrase, synonyms, acronyms, descriptions. I can get someone to define detail taxonomy for each category. Use boosting, use pluggable scoring lib
Machine learning:
Naive Bayes classification
Decision tree
SVM
I have tried out Solr for this with revers lookup though since I don't have taxonomy available at moment. It seems like I can get about 80% true positives (I'll have to dig more into confusion matrix to reduce false positives). My query is bunch of booleans terms and phrases with proximity and boosts; negations to reduce errors. I'm afraid this approach may lead to overfit and wont scale.
I am aware that people usually tries multiple modeling techniques to achieve which one works best or derives combination of techniques. I want to understand this problem with feasibility and complexity point of view. If its too broad question please just comment on feasibility of solution.

What's the best way to store unstructured text file for data mining

I have millions of text news on my machine. I want to do some text mining on it.
I want first to store thest text news in a more structured way. what's the best way to do it ? so It will become more convenient to do data mining later on.
Currently I just store these news file in database indexed by the news headlines and the file path.
Any suggestion will be really appreciated. Thanks!
That depends greatly on what you want to achive with the more structured data.
If the data size is not heavy, you could use "in text" search on your database and you are aldready done.
A category or "tag" like here on stackoverflow would help greatly to categorize and group your content, but I guess it is very hard to extract that from your pure text base now.
Also a simple timestamp (you could get from the file itself, but be wary some systems alter that date when files get copied...) could help too.
For content extraction, have a look at http://www.opencalais.com/ , it offers an api for "text" analysis you might find interesting.
What do you mean by "do some text mining"? Are you just looking to store the text? Or, are you looking for a solution?
Many databases offer the capability to store text and do fast retrievals on them.
However, text mining typically covers a broader range of themes. Here are some examples:
Finding documents with similar themes.
Exposing sentiment in the documents.
Answering questions posed in natural language.
Summarizing documents.
Filling in data structures with information from documents.
Using information from documents for predictive modeling purposes.
Assigning codes to documents.
For such analyses, you would normally use text mining tools (you can look for these on kdnuggets.com, for instance). The tool then affects how the text is stored.
The last chapter of "Data Mining Techniques for Marketing, Sales, and Customer Support" is about text mining and has a very good case study on text mining applied to customer service records.
[In response to comment]
Is this an academic project or "real world"? Is the text monolingual? If so, is it English? You definitely need to do some research. Text analysis/mining has been an area of rather intense study since, at least, the time when Alan Turing proposed the Turing test in the 1930s.
As an example, I can readily think of four very different options for storing text for analysis. The first is "as is", which is most useful if you have lots of processors and memory. The second would be "grammatically", with text tagged with grammar and meanings, which is most effective if you have a team with lots of PhDs. Third is as an inverted index, which is the basic form for searching and some proximity matching. The fourth is by projecting onto an orthogonal space, using singular value decomposition (most useful if you want to use the text as input to other statistical techniques).

Best way to store large searchable text files

I am developing an online Bible search program. The Bible is a pretty large book, taking up nearly 5MB of space in plain text. I am planning on implementing an API in the program as well allowing other websites to include their own Bible search widgets and programs without having to develop the search queries or storing Bibles on their own servers.
With this in mind, I am going to expect that eventually I will have a moderate flow of queries passing through the program. Also, for those not familiar with the Bible, it has 2 methods of formatting the text. It can contain both red text and italics. I need a way to store the Scriptures along with the red letter and italics formatting but allowing the search queries to ignore the formatting.
It also needs to be fast and as efficient (memory and cpu usage) as possible. Any storage format will be considered (MySQL, JSON or XML text files, etc) as long as the querying can be done ignoring the formatting. File size and count doesn't really matter, so splitting up the books or even chapters into separate files is fine by me.
One more important thing to keep in mind though, is that I want to have some form of search method that can search across multiple verses. So a search for "but have everlasting life for God sent not his Son" would return John 3:16,17. Thanks for all ideas!
There are a bunch of different open source document search engines which are made for precisely what you're trying to do. Solr, Elastic Search, Xapian, Whoosh, Haystack (made for Django) and others. There are other posts on S.O. and elsewhere that go into the benefits of using one vs another, but your requirements are simple enough that any of them will be more than fine (and easily scale with very minimal effort should your project take off, which is always nice to know). So look at their examples and see which one looks most intuitive to you - Solr is arguably the most popular and it's the only one I've worked with, but Elastic Search uses the same popular Lucene backend and is apparently much easier to get up and running, so I would start there.
As for the actual implementation, you'll want to index each verse as a separate "document" if the single verse (or just verse number) is what you want to return. The search engine handles the ranking of the results based on relevancy (usually using a tf/idf algorithm, in case you're interested).
The way I'd handle the italics and red text is to include some kind of markup in the text (i.e. wrap the phrase in single asterisks for italics, double asterisks for red) and then tell the analyzer to ignore those characters - there may be a simpler way in the framework you end up choosing, though, so take that with a grain of salt. The queries spanning multiple verses requirement is more complicated, but the answer will probably involve indexing each whole chapter as a document instead of (or maybe in addition to? I'd have to think about it more) each verse.
A word of caution - if you're not familiar with search indexing, even something designed to be plug-and-play like Elastic Search will probably still require some time and effort to set up, so if you absolutely need to get this up and running quickly and you're already familiar with MySQL I suppose it could work (it does do fulltext search). But it's certainly not the best tool for the job, so if this is a project that you're invested in you will thank yourself later if you put in a little bit of work to learn one of these search frameworks. It may be overkill in terms of the amount of text you're dealing with, as others have pointed out, but it will be extremely flexible in how you can search on that text which seems to be what you want. For instance, adding other requirements later on would be very straightforward (for instance, you could let people limit their search to only matches in the red text).
I didn't know the bible had formatting. What is it used for? If it is for the verses, I'd suggest you store every verse in a database. In a highly normalized form, you got a table with books, a table with chapters and a table with verses. Each verse consists of a verse number and a verse text.
Now, I think the chapters don't have titles so they are actually just a number as well. In that case it it silly to store them separately, so you got just your table of books and a table of verses, in which each verse has a chapter number and a verse number and a verse text. That text I think of to be plain text, isn't it?
If the verse is plain text, you can easily make it searchable by storing it in MySQL and create a FULLTEXT index for it. That way, you can search quite efficiently and even use wildcards and such.
If the verse was to have formatting, you could choose to create two columns, one with the plain text for searching, and one with the formatted text for display, but I doubt you would need this.
PS: 5 MB of text is nothing really. If you got a dedicated program, you could keep it in memory in a single string and use strpos or a similar function to find a text. What language, database and platform are you using?

How to implement an Enterprise Search

We are searching disparate data sources in our company. We have information in multiple databases that need to be searched from our Intranet. Initial experiments with Full Text Search (FTS) proved disappointing. We've implemented a custom search engine that works very well for our purposes. However, we want to make sure we are doing "the right thing" and aren't missing any great tools that would make our job easier.
What we need:
Column search
ability to search by column
we flag which columns in a table are searchable
Keep some relation between db column and data
we provide advanced filtering on the results
facilitates (amazon style) filtering
filter provided by grouping of results and allowing user to filter them via a checkbox
this is a great feature, users like it very much
Partial Word Match
we have a lot of unique identifiers (product id, etc).
the unique id's can have sub parts with meaning (location, etc)
or only a portion may be available (when the user is searching)
or (by a decidedly poor design decision) there may be white space in the id
this is a major feature that we've implemented now via CHARINDEX (MSSQL) and INSTR (ORACLE)
using the char index functions turned out to be equivalent performance(+/-) on MSSQL compared to full text
didn't test on Oracle
however searches against both types of db are very fast
We take advantage of Indexed (MSSQL) and Materialized (Oracle) views to increase speed
this is a huge win, Oracle Materialized views are better than MSSQL Indexed views
both provide speedups in read-only join situations (like a search combing company and product)
A search that matches user expectations of the paradigm CTRL-f -> enter text -> find matches
Full Text Search is not the best in this area (slow and inconsistent matching)
partial matching (see "Partial Word Match")
Nice to have:
Search database in real time
skip the indexing skip, this is not a hard requirement
Spelling suggestion
Xapian has this http://xapian.org/docs/spelling.html
Similar to google's "Did you mean:"
What we don't need:
We don't need to index documents
at this point searching our data sources are the most important thing
even when we do search documents, we will be looking for partial word matching, etc
Ranking
Our own simple ranking algorithm has proven much better than an FTS equivalent.
Users understand it, we understand it, it's almost always relevant.
Stemming
Just don't need to get [run|ran|running]
Advanced search operators
phrase matching, or/and, etc
according to Jakob Nielsen http://www.useit.com/alertbox/20010513.html
most users are using simple search phrases
very few use advanced searches (when it's available)
also in Information Architecture 3rd edition Page 185
"few users take advantage of them [advanced search functions]"
http://oreilly.com/catalog/9780596000356
our Amazon like filtering allows better filtering anyway (via user testing)
Full Text Search
We've found that results don't always "make sense" to the user
Searching with FTS is hard to tune (which set of operators match the users expectations)
Advanced search operators are a no go
we don't need them because
users don't understand them
Performance has been very close (+/1) to the char index functions
but the results are sometimes just "weird"
The question:
Is there a solution that allows us to keep the key value pair "filtering feature", offers the column specific matching, partial word matching and the rest of the features, without the pain of full text search?
I'm open to any suggestion. I've wondered if a document/hash table nosql data store (MongoDB, et al) might be of use? ( http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo ). Any experience with these is appreciated.
Again, just making sure we aren't missing something with our in-house customized version. If there is something "off the shelf" I would be interested in it. Or if you've built something from some components, what components (search engines, data stores, etc) did you use and why?
You can also make your point for FTS. Just make sure it meets the requirements above before you say "just use Full Text Search because that's the only tool we have."
I ended up coding my own.
The results are fantastic. Users like it, it works well with our existing technologies.
It really wasn't that hard. Just took some time.
Features:
Faceted search (amazon, walmart, etc)
Partial word search (the real stuff not full text)
Search databases (oracle, sql server, etc) and non database sources
Integrates well with our existing environment
Maintains relations, so I can have a n to n search and display
--> this means I can display child records of a master record in search results
--> also I can search any child field and return the master record
It's really amazing what you can do with dictionaries and a lot of memory.
I recommend looking into Solr, I believe it will meet you needs:
http://lucene.apache.org/solr/
For an off-she-shelf solution: Have you checked out the Google Search Appliance?
Quote from the Google Mini/GSA site:
... If direct database indexing is a requirement for you, we encourage you to consider the Google Search Appliance, which has direct database connectivity.
And of course it indexes everything else in the Googly manner you'd expect it to.
Apache Solr is a good way to start your project with and it is open source . You can also try Elastic Search and there are a lot of off shelf products which offer good customization abilities and search features such as Coveo, SharePoint Fast, Google ...

Resources