Hit Highlighting with SQl Server 2008 FTS

Hit Highlighting with SQl Server 2008 FTS - sql-server

This question was here already but there was no answer, so trying one more time - how to do hit highlighting of results with SQL 2008 FTS?
So far I found SQLHighlighter but it is commercial product. I also tried solution described in this book http://apress.com/book/view/9781430215943 but performance was extremely poor. As last resort I tried Lucene.Net Highlighter, but it is linked with Lucene (which I'm trying to get away from).
Can someone recommend other way?

AFAIK it is a listed 'bug' of FTS that it does not return any hit tracking information
You are left with parsing the query yourself, and matching that to each result rows text columns and doing the highlight
This could be quite simple or very hard depending on how you are building the FTS queries

Related

Which database for good, standard full-text searching abilities

I've been using MySQL happily for many years but have now come across an issue using it and wondered where I should go from here. My issue is full-text indexing. MySQL just doesn't perform well using this feature with large tables, unless you use a third-party plugin like Lucene etc.
I don't mind paying for the database but would prefer a free service. I don't have a DB administration team that can maintain it, it's just me, so it has to be simple to maintain, develop and scale. I develop on a Windows IIS7 environment, usually in ASP.NET and Classic ASP. My application will probably have a maximum of 10 million rows in the full-text table, so not huge but fairly hefty.
I could quite easily grab Lucene and use that with MySQL, but I would really like to know which DB performs best using full-text indexing, straight out of the box, so-to-speak?
Any suggestions or experiences would be marvelous.

cant help a lot other than saying that full text on sql server is amazing. It seems complicated at first (because of catalogs, indexes end everything) but once you give it a go you'll see thats quite simple to implement. This website shows an example with screens
you also have several functions to manipulate (search) the data (the saurus, stoplists, etc..)

Postgres.
In addition, there are Ruby gems to take advantage of it easily.

I worked at a place that used Oracle for full-text search, and they were happy with that until they found Lucene -- now they are switching to Lucene.
I've heard good things about Postgres' full-text search, but I've never seen it in action.
Lucene.NET is a straight .NET port of Lucene, and performs well.

SQL Server and regular expressions

Which will be better among following option ?
Regular-expressions in SQL Server searches, as new versions support CLR objects ?
Full text search ?
Lucene.net combined with SQL Server ?
Our database will be having millions and millions of records and we will be providing Google-like search option, and like Google searches for anything, we will be searching some specific categories only.
Please help.

Regular Expressions are fine as long as your data is small, very small.
Full text Search with SQL is good choice. I do not personally do not like this option because the search syntax isn't as expressive as Lucene.Net. But either way it is a good way to quickly get some full text search going, without going into a lot of details.
Lucene.Net gives your more control/responsibility of creating and maintaining the index, so if this doesn't scare you away then Lucene.Net gives your high quality results and you can do a lot with it. You can customize and tweak just about everything to get your search engine working the way you want it to work. I would personally choose Lucene.Net.
In sort.
don't use regular expressions.
SQL Server Full Text Search is a quick and easy way to get a decent search out of it, without being to technical.
Lucene.Net is the best for it's quality of results, but requires you to go through some learning (if your new).

For searching large amounts of data, you want a full text index. Regular expressions are more flexible and can provide more power to your users to express their queries, but it will be slower.
Lucene is a fine choice, but you might find that the built-in features that SQL Server has already meet your needs.

Despite being a fan of all things SQL Server, I would favor lucene.net over SQL Server's Full Text Search.

"sounds-like", "did you mean THAT" functionality using full text search in SQL Server 2005

I have implemented full text search over SQL Server 2005 database using CONTAINSTABLE keyword.
I was wondering is there a way to add a "sounds like" or google's "did you mean THAT" functionality if the original query yields no results.

The soundex for SQL Server is very limited and frustrating, I really recomend you to take a look at Lucene.net http://incubator.apache.org/lucene.net/. Lucene is a high-performance, full-featured text search engine library, it is also very easy to use in .NET projects. If you need a serious search engine for you app go with Lucene.
Some features retrieved from http://lucene.apache.org/java/docs/features.html:
ranked searching, best results
returned first many powerful query
types: phrase queries, wildcard
queries, proximity queries, range
queries and more fielded searching (e.g., title, author, contents)
ate-range searching sorting by any
field multiple-index searching with
merged results allows simultaneous
update and searching

SQL Server has the functions SOUNDEX and DIFFERENCE
This related SO answer might be useful: How to make a sql search query more powerful?

If you want to be able to do this you need to normalize the raw text and the queries. Simple example, if you want to be able to search on a SOUNDEX type of value, you'll need to SOUNDEX both the query string and the original raw data that you're querying. You can't efficiently process the query space on the fly, so instead you normalize it during the creation of the index.
Technically, you need only normalize the actual index, not the data, but since your data likely IS you index, then it will need to be normalized.
This is the same process as "stemming" of words, removing plurals, etc.

Is SQL Server's Full Text Search the right tool for searching phrases, not documents?

30 million distinct phrases, not documents, ranging from one word to a 10 word sentence and I need to support word/phrase searching. Basically what where contains(phrase, "'book' or 'stack overflow'") offers.
I have an instance of SQL Server 2005 (32 bit, 4 proc, 4gb) going against several full text catalogs and performance is awful for word searches with high cardinality.
Here are my thoughts to speed things up, perhaps someone can offer guidance--
1) Upgrade to 2008 iFTS, 64bit. Sql Server 2005 FTS's windows service is never more than 50mb. From what I have gathered, it uses the file system cache for looking up catalog indexes. My populated catalogs on disk are only around 300mb, so why can't this all be in memory? Might iFTS's new memory architecture, which is part of the sqlserver process help here?
2) Scale out the catalogs to several servers. Will the queries to the linked FTS servers run in parallel?
3) Since I'm searching phrases here and not documents, maybe Sql Server's Full Text Search isn't the answer. Lucene.NET? Put the catalog index on a ram drive?

Lucene.Net can offer very high performance for this kind of application along with a pretty simple API. Release 2.3.2 is nearing completion, which offers additional performance increases over release 2.1. While putting the Lucene index in a RAMDirectory (Lucene's memory-based index structure) will offer even better performance, we see great results even with the FSDirectory (a disk-based index).

I'm slightly surprised that FTS is creaking under this sort of load. However, if this proves to be the case, then the classic approach (Gary Kildall developed it for searching CDs!) would be to use an inversion index. I've used this technique for a long time with a succession of applications. It is usually called the ‘Inverted’ or ‘Inversion’ index technique. (see http://en.wikipedia.org/wiki/Search_engine_indexing#Inverted_indices ). The technique scales very well and I've tested it indexing up to 8 million documents. Even when searching through eight million documents, It gets results within three seconds if the indexes are right. Often it is a lot quicker than this.
I use an Inversion index to get (up to a bearable number of via TOP x ) a pool of the likely candidates, and then do a brute-force search of these with a regex. It works very well.

As an out of the box solution i would prefer using "Microsoft Office SharePoint Server" for indexing and searching within the content of documents.
A free alternative is Lucene.Net library if you want to write your own service for indexing and searching. Writing your own full-text search service with Lucene.Net will give you all the flexibility you need (yes you can store the index on an external storage if you want to).

Take a look at Apache Solr. It's a search server that wraps Lucene with a HTTP interface. Each of your phrases would map to a Solr document. 30M documents is not a lot for Solr since your documents would be very short. The final performance would also depend on how many queries/sec you need.

Configure Lucene.Net with SQL Server

Has anyone used Lucene.NET rather than using the full text search that comes with sql server?
If so I would be interested on how you implemented it.
Did you for example write a windows service that queried the database every hour then saved the results to the lucene.net index?

Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.
I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.
Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.
BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.
Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)
EDIT Corey Trager documented how to use Lucene.NET in BugTracker.NET here.

I have not done it against database yet, your question is kinda open.
If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database.
If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.
I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.

I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.
Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.
With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.

I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:
Store record:
insert text, other data to the table
get latest inserted ID
create lucene document
put (ID, text) into lucene document
update lucene index
Querying
search lucene index
for each lucene doc in result set load data from DB by stored record's ID
Just to note, I switched from Lucene to Sphinx due to it superb performance