SQL Server and regular expressions - sql-server

Which will be better among following option ?
Regular-expressions in SQL Server searches, as new versions support CLR objects ?
Full text search ?
Lucene.net combined with SQL Server ?
Our database will be having millions and millions of records and we will be providing Google-like search option, and like Google searches for anything, we will be searching some specific categories only.
Please help.

Regular Expressions are fine as long as your data is small, very small.
Full text Search with SQL is good choice. I do not personally do not like this option because the search syntax isn't as expressive as Lucene.Net. But either way it is a good way to quickly get some full text search going, without going into a lot of details.
Lucene.Net gives your more control/responsibility of creating and maintaining the index, so if this doesn't scare you away then Lucene.Net gives your high quality results and you can do a lot with it. You can customize and tweak just about everything to get your search engine working the way you want it to work. I would personally choose Lucene.Net.
In sort.
don't use regular expressions.
SQL Server Full Text Search is a quick and easy way to get a decent search out of it, without being to technical.
Lucene.Net is the best for it's quality of results, but requires you to go through some learning (if your new).

For searching large amounts of data, you want a full text index. Regular expressions are more flexible and can provide more power to your users to express their queries, but it will be slower.
Lucene is a fine choice, but you might find that the built-in features that SQL Server has already meet your needs.

Despite being a fan of all things SQL Server, I would favor lucene.net over SQL Server's Full Text Search.

Related

Which database for good, standard full-text searching abilities

I've been using MySQL happily for many years but have now come across an issue using it and wondered where I should go from here. My issue is full-text indexing. MySQL just doesn't perform well using this feature with large tables, unless you use a third-party plugin like Lucene etc.
I don't mind paying for the database but would prefer a free service. I don't have a DB administration team that can maintain it, it's just me, so it has to be simple to maintain, develop and scale. I develop on a Windows IIS7 environment, usually in ASP.NET and Classic ASP. My application will probably have a maximum of 10 million rows in the full-text table, so not huge but fairly hefty.
I could quite easily grab Lucene and use that with MySQL, but I would really like to know which DB performs best using full-text indexing, straight out of the box, so-to-speak?
Any suggestions or experiences would be marvelous.
cant help a lot other than saying that full text on sql server is amazing. It seems complicated at first (because of catalogs, indexes end everything) but once you give it a go you'll see thats quite simple to implement. This website shows an example with screens
you also have several functions to manipulate (search) the data (the saurus, stoplists, etc..)
Postgres.
In addition, there are Ruby gems to take advantage of it easily.
I worked at a place that used Oracle for full-text search, and they were happy with that until they found Lucene -- now they are switching to Lucene.
I've heard good things about Postgres' full-text search, but I've never seen it in action.
Lucene.NET is a straight .NET port of Lucene, and performs well.

Is it advisable to use Lucene for this?

I have a huge XML file, about 2GB in size, containing Resumes. There are thousands of resumes in this file, tagged properly. Right now I am using XPATH to query it. So is it advisable to use Lucene for the same instead of XPATH?
Depends upon what your requirements are. If you need full-text searching and all other great features of a full-blown search engine, Lucene is the way to go. I would recommend Solr which builds on top of lucene and provides a much better API and abstraction.
Like everything else technology related, it depends.
What Lucene gives you that you're not getting with XPath is the power of a full-text engine that supports among other things ranking and the ability to phrase queries, wildcard queries etc.
Based on your use-case I would say that at full-text search engine makes sense. That's not to say that vanilla Lucene is the best way to go (there are for example other alternatives that build on Lucene).
2GB seems to be pretty less for which I would contruct my own inverted index (a minimal one) :) However no problem in using Lucene/Solr though. Go ahead. It will help you once your records starts doubling. However at this scale (2GB) or even much larger many real life stuff is working on databases full text searches using SQL like keyword.

Data Correlation in large Databases

We're trying to identify the locations of certain information stored across our enterprise in order to bring it into compliance with our data policies. On the file end, we're using Nessus to search through differing files, but I'm wondering about on the database end.
Using Nessus would seem largely pointless because it would output the raw data and wouldn't tell us what table or row it was in, or give us much useful information, especially considering these databases are quite large (hundreds of gigabytes).
Also worth noting, this system needs to be able to do pattern-based matching (such as using regular expressions). Not just a "dumb search" engine.
I've investigated the use of Data Mining and Data Warehousing in order to find this data but it seems like they're more for analysis of data than actually just finding data.
Is there a better method of searching through large amounts of data in a database to try and find this information? We're using both Oracle 11g and SQL Server 2008 and need to perform the searches on both, so I'd like to stay away from server-specific paradigms (although if I have to rewrite some code to translate from T-SQL to PL/SQL, and vice versa, I don't mind)
On SQL Server for searching through large amounts of text, you can look into Full Text Search.
Read more here http://msdn.microsoft.com/en-us/library/ms142559.aspx
But if I am reading right, you want to spider your database in a similar fashion to how a web search engine spiders web sites and web pages.
You could use a set of full text queries that bring back the results spanning multiple tables.
Oracle supports regular expression with the RegExp_Like() function and it ought to be fairly straightforward to automate the generation of the code you need based on system metadate (to find all text columns over a certain length, for example, and include them in a predicate againt that table to find the rows and values that match your regexp). Doesn't sound too challenging really. In theory you could check constrain columns to prevent the insertion of values that match a regexp but that might be overkill.
Oracle Text is suited for searching for words/phrases in larg(ish) bits of text (eg PDFs, HTMLs, TXT or DOCs) held in the database. There is some limited fuzziness searching, but not regular expressions per se.
You don't really go into what sort of data you are looking for or what you have in your databases. Nessus indicates you are looking for security issues, but the title of "Data Correlation" suggests something completely different.
Really the data structures should provide the information about what to look for and where. That's what databases are about - structuring data for accessibility. A database backing a CMS, forum software or similar would be a different kettle of fish.

"sounds-like", "did you mean THAT" functionality using full text search in SQL Server 2005

I have implemented full text search over SQL Server 2005 database using CONTAINSTABLE keyword.
I was wondering is there a way to add a "sounds like" or google's "did you mean THAT" functionality if the original query yields no results.
The soundex for SQL Server is very limited and frustrating, I really recomend you to take a look at Lucene.net http://incubator.apache.org/lucene.net/. Lucene is a high-performance, full-featured text search engine library, it is also very easy to use in .NET projects. If you need a serious search engine for you app go with Lucene.
Some features retrieved from http://lucene.apache.org/java/docs/features.html:
ranked searching, best results
returned first many powerful query
types: phrase queries, wildcard
queries, proximity queries, range
queries and more fielded searching (e.g., title, author, contents)
ate-range searching sorting by any
field multiple-index searching with
merged results allows simultaneous
update and searching
SQL Server has the functions SOUNDEX and DIFFERENCE
This related SO answer might be useful: How to make a sql search query more powerful?
If you want to be able to do this you need to normalize the raw text and the queries. Simple example, if you want to be able to search on a SOUNDEX type of value, you'll need to SOUNDEX both the query string and the original raw data that you're querying. You can't efficiently process the query space on the fly, so instead you normalize it during the creation of the index.
Technically, you need only normalize the actual index, not the data, but since your data likely IS you index, then it will need to be normalized.
This is the same process as "stemming" of words, removing plurals, etc.

Hit Highlighting with SQl Server 2008 FTS

This question was here already but there was no answer, so trying one more time - how to do hit highlighting of results with SQL 2008 FTS?
So far I found SQLHighlighter but it is commercial product. I also tried solution described in this book http://apress.com/book/view/9781430215943 but performance was extremely poor. As last resort I tried Lucene.Net Highlighter, but it is linked with Lucene (which I'm trying to get away from).
Can someone recommend other way?
AFAIK it is a listed 'bug' of FTS that it does not return any hit tracking information
You are left with parsing the query yourself, and matching that to each result rows text columns and doing the highlight
This could be quite simple or very hard depending on how you are building the FTS queries

Resources