SQL Server 2008 Full Text Search (FTS) versus Lucene.NET - sql-server

I know there have been questions in the past about SQL 2005 versus Lucene.NET but since 2008 came out and they made a lot of changes to it and was wondering if anyone can give me pros/cons (or link to an article).

SQL Server FTS is going to be easier to manage for a small deployment. Since FTS is integrated with the DB, the RDBMS handles updating the index automatically. The con here is that you don't have an obvious scaling solution short of replicating DB's. So if you don't need to scale, SQL Server FTS is probably "safer". Politically, most shops are going to be more comfortable with a pure SQL Server solution.
On the Lucene side, I would favor SOLR over straight-up Lucene. With either solution you have to do more work yourself updating the index when the data changes, as well as mapping data yourself to the SOLR/Lucene index. The pros are that you can easily scale by adding additional indexes. You could run these indexes on very lean linux servers, which eliminates some license costs. If you take the Lucene/SOLR route, I would aim to put ALL the data you need directly into the index, rather than putting pointers back to the DB in the index. You can include data in the index that is not searchable, so for example you could have pre-built HTML or XML stored in the index, and serve it up as a search result. With this approach your DB could be down but you are still able to serve up search results in a disconnected mode.
I've never seen a head-to-head performance comparison between SQL Server 2008 and Lucene, but would love to see one.

I built a medium-size knowledge base (maybe 2GB of indexed text) on top of SQL Server 2005's FTS in 2006, and have now moved it to 2008's iFTS. Both situations have worked well for me, but the move from 2005 to 2008 was actually an improvement for me.
My situation was NOT like StackOverflow's in the sense that I was indexing data that was only refreshed nightly, however I was trying to join search results from multiple CONTAINSTABLE statements back in to each other and to relational tables.
In 2005's FTS, this meant each CONTAINSTABLE would have to execute its search on the index, return the full results and then have the DB engine join those results to the relational tables (this was all transparent to me, but it was happening and was expensive to the queries). 2008's iFTS improved this situation because the database integration allows the multiple CONTAINSTABLE results to become part of the query plan which made a lot of searches more efficient.
I think that both 2005 and 2008's FTS engines, as well as Lucene.NET, have architectural tradeoffs that are going to align better or worse to a lot of project circumstances - I just got lucky that the upgrade worked in my favor. I can completely see why 2008's iFTS wouldn't work in the same configuration as 2005's for the highly OLTP nature of a use case like StackOverflow.com. However, I would not discount the possibility that the 2008 iFTS could be isolated from the heavy insert transaction load... but it also sounds like it could be as much work to accomplish that as move to Lucene.NET ... and the cool factor of Lucene.NET is hard to ignore ;)
Anyway, for me, the ease and efficiency of SQL 2008's iFTS in the majority of situations probably edges out Lucene's 'cool' factor (though it is easy to use, I've never used it in a production system so I'm reserving comment on that). I would be interesting in knowing how much more efficient Lucene is (has turned out to be? is it implemented now?) in StackOverflow or similar situations.

This might help:
https://blog.stackoverflow.com/2008/11/sql-2008-full-text-search-problems/
Haven't used SQL Server 2008 personally, though based on that blog entry, it looks like the full-text search functionality is slower than it was in 2005.

we use both full-text-search possibilities, but in my opinion it depends on the data itself and your needs.
we scale with web-servers, and therefore i like lucene, because i don't have that much load on the sql-server.
for starting at null and wanting to have a full-textsearch i would prefer the sql-server solution, because i think it is really fast to get results, if you want lucene you have to implement more at start (and also get some know-how).

One consideration that you need to keep in mind is what kind of search constraints you have in addition to the full-text constraint. If you are doing constraints that lucene can't provide, then you will almost certainly want to use FTS. One of the nice things about 2008 is that they improved the integration of FTS with standard sql server queries so performance should be better with mixed database and FT constraints than it was in 2005.

Related

Lucene.Net, SQL Server, NHibernate, ASP.NET MVC

I am using these technologies: SQL Server 2005, ASP.NET MVC, NHibernate/sharp architecture and would like to mine some text with the final aim of presenting some web based stats . I have several millions of keywords and several millions of documents and would like to run some queries based on these documents indexed by the keywords. I have played a bit with SQL Server’s full text indexing but I am not too impressed. So I am wondering whether Lucene.Net might be an alternative.
I have never used Lucene.Net but understand that it is a 1:1 port of the Java version. So my first question is whether it is worth studying the book ‘Lucene in action’ – provided that Lucene would be the right ‘technology’?
Thanks.
Best wishes,
Christian
Well,
FIRST - update SQL Server. You use a two generations outdated version which had the first implementation of full text search in SQL Server and many (known and fixed) shortcomings.
Second - Lucene may really be better suited. SQL is primarily a database server, and the full text search does a lot of things, but also has a lot of limitations.
But entering Lucene DOES provide a significant complication - distributed transactions, backup handling turn a lot more complicated as they are two systems. SQL 2008 R2 does a much better job here (full text index stored in the database file).
That said, also be careful with performance. You may need a QUITE HIGH END SERVER if you want to run a lot of queries in parallel (which can happen easily with a web application). This may require multiple database servers running read only replications - something SQL Server does a lot easier than Lucene (as in: out of the box).
I suggest you just get Lucene and play with it ;) Not a lot more needed.

Third Party Tools for Monitoring SQL Server Performance

I'm in a situation where I came into a new job and I have to support several legacy systems. The original developer is no longer around. These legacy systems are really hammering away at our SQL Server and killing performance. I know that there are a lot of things that can be done in the code, but rewriting code is really my last resort.
What I'm looking for is some sort of tool that will monitor the queries coming into the server and give recommendations on indexing solutions. I know I can use the SQL Server Profiler but I'm looking for something a little more user friendly and something that can help me make the indexing decisions.
I know I didn't explain it very well, but I'm sure this is a common request. I'd like to make informed decisions on what to index and avoid "shooting from the hip" and indexing everything in sight. Thanks for any recommendations!
You don't need a third party tool for this.
Assuming SQL Server 2005+ as long as you can use SQL Profiler (actually SQL Trace - Don't use the Profiler GUI for this to reduce tracing overhead as much as possible) to collect a representative workload you can use the Database Tuning Advisor to automate analysis of the workload and make indexing recommendations.
You can also use the Missing Index DMVs for a quick overview of areas to investigate but the DTA will do more holistic analysis and take into account possible adverse effects of indexes on data modification statements.
+1 for Martin's answer, but since you asked about 3rd party tools, I'll mention one of my favorites (and no, I don't work for the company). Ignite for SQL Server does an excellent job of analyzing server activity in terms of wait time analysis. It won't make recommendations for you, but it will quickly identify the worst performing queries where you need to focus your effort.
SQL Server 2005+ has a lot of DMV's (Dynamic Management views) that you can query to get server info, as well as the Profiler / SQL Trace tool.
We administer several large database servers.
Idera is a good tool to manage multiple database servers easily.
I think you'd make a much better DBA if you learn more about the inbuilt functionality of SQL server.
Have a browse of
http://msdn.microsoft.com/en-us/library/ms188754.aspx
to find out more about DMV's and functions.
Another common issue with performance could be your indexes.
Theres a great tutorial that combines the DMV's with improving indexes here:
http://searchsqlserver.techtarget.com/tip/Using-dynamic-management-views-to-improve-SQL-Server-index-effectiveness
Idera is really worth checking out though as a good starting point. Combined with DMV's & SQL trace there shouldn't be much you won't be able to fix.
Idera just takes most of the leg work out of doing things.
http://www.idera.com/Content/Home.aspx
Idera: SQL Diagnostic Manager

Database tuning advices

Possibly some of you don't even know about these features so you will learn a lot from this post which will in fact help me to optimize better and some of you probably use them on daily basis so you can help me and other less DBA proof users.
I'm using SQL-Server 2005 Standard
I run SQL Server Profiler a lot. Each time i find ad hoc queries or sps which execution time exceed my possible limits of under 100ms for complex queries and above 30ms for short ones (number does not mean a thing, just to make some sense). After i find possibly problematic queries i write them down so i can use Database Engine Tuning Advisor which executes overloaded queries on tables and at the result gives me indexes i need to build in order to improve performance. Each night i execute index rebuild function from Maintenance Plans.
Now question time!!!
1.if Database Engine Tuning Advisor gives me 10 indexes to create while improvement percentage is about 40% should i use it's advice or not? Better question is what is ratio number of indexes/improvement percentage i should follow. Indexes take space and time to rebuild.
2.If i create about 5-7 indexes for each problematic query, i can end up with 500 indexes per DB. How many indexes can i build so DB will perform normally? are there any limitations?
3.Is there any other way to optimize ( nor re-design ) your DB other than using my method or going sp by sp by your hands and eyes?
There's no right answer to this question as it depends heavily on your workload.
For workloads with a heavy ratio of reads (e.g. data warehouse) it might make sense to create an index which it would be positively counter productive to create for an environment with a greater amount of writes.
The DTA can help with this regard by assessing the impact on the overall workload but you would need to try and capture a representative sample (not just the poor performing queries). SQL Profiler is quite resource intensive so to do this with the least possible impact on your server you would need to use a server side SQL trace with appropriate filters to only log events related to the database of interest.
To identify the poorest performing queries in isolation If you have at least SQL2005 SP1 client tools installed you should be able to right click the database node in Management Studio and use the Reports -> Standard Reports menu to see the plans in the cache with highest CPU/IO.
If you are interested in this area I recommend the book SQL Server 2008 Query Performance Tuning Distilled (most of it applicable to SQL2005 as well)
You can get SQL Profiler to log to a table, so it will write the queries to a table you specify. If you can, leave it running for a few hours - Or however long it takes to cover as many queries/events as possible.
Next, use Database Engine Tuning Advisor - And get it to use this table of queries as its source input. You will find it looks at the whole pattern, and will recommend you create some indices, and remove others.
This is better than looking at queries one by one in isolation, although that still has its place.

ORM and database indexes

What approach do you have towards creating and maintaining database indexes when using ORM such as NHibernate/Hibernate.
Since the ORM is generating the queries, are there any tools you could recommend that could analyze query plans of those and suggest the kind of indexes that should be created?
My current approach is ... wait until something works slow and then find the slow query and optimize it ... but this is sort of lame isn't it? My goal is not to end up with tens or hundreds of indexes of which nobody knows which are actually being used by the system and which aren't. So again index maintenance.
My environment is NHibernate + SQL Server 2005.
I find that the columns that need indexing are typically "obvious". By that I mean if you create queries like "select p from Person p where p.surname = :surname" then whatever column surname refers to needs an index.
Likewise every foreign key should be indexed.
And no I don't wait until performance is actually a problem. Indexes are just something I do right from the start.
Oh the other thing I wanted to add was that most (if not all) ORMs have the ability to turn on statement logging. These often aren't particularly readable (single line, table names of t0, t1, t2, etc) but this could tell you what queries were run and how often.
The standard tools you would use to analyse slow queries / poor indexing apply whether or not you are using an ORM. You can use sql server profiler to examine the sql statements that are running against your database and then use the index plan features in the query window in sql server management studio / sql query analyser to see the details of your query plans and get an idea of which indexes you may need to add.
You can also use the Database Engine Tuning Advisor in sql management studio, although whether or not that tool is actually more useful than simply spending some time thinking about your database design and querying patterns is open to question.

Sql Server 2005 efficiency savings?

Are there good efficiency savings using Sql Server 2005 over Sql Server 2000?
Or does it just have more services etc
Has anyone seen their system work any quicker after making the upgrade?
The surrounding tools such as Analysis Services were substantially rewritten and can get you a variety of wins depending on your requirements. However I don't see a lot of really fundamental changes from 2000 to 2005 in the core database engine.
There are some improvements that may get you better performance in certain situations. SQL2005 has much better support for 64-bit architectures and better table partitioning than SQL2000 (you can partition a table as opposed to making partitioned views). 64-bit support is the most likely to give you a performance win on a large system as it allows you to set up much larger caches.
Apart from those features I don't believe that there is really a large difference. There are probably minor performance tweaks.
The main reason to move from SQL2000 to SQL2005 will be when SQL2000 goes out of support. If you have a running application on SQL2000 there are not a lot of compelling reasons to switch to 2005 while 2000 is still supported by Microsoft.
Data Warehouse systems will get quite a few wins from moving to SQL2005. SSIS, SSAS2005 and SSRS2005 are much better than their SQL2000 counterparts.
2005 provides MVCC - row level versioning essentially - so as a developer there are some efficiencies: less locking to worry about.
I haven't migrated a system from 2000 to 2005 - I've either started with one or the other - so I don't have a comparison of my own. But there is a reasonable chance you will see a perf difference; if not by taking advantage of some of the new features like snapshot isolation, then at least by virtue of the fact that SQL2005's licensing model allows you to go multi-core at no additional licensing cost, and by the fact that SQL2005 has improved memory management.
Things will absolutely run faster with 2005. There were several improvements made to the query optimizer. And now you can create covering indexes so that the included columns only exist at the leaf level and don't have to get sorted. That alone is an enormous improvement and reason enough to upgrade.
SQL 2005 does a better job of working with caching. You used to have to poll SQL 2000 periodically to check for updates to a whole table. Now you can subscribe to a notification when something changes. It also works for queries, tables, and a few other elements.
I would say yes for all of the reasons listed by others, but even if your SQL skills are not that strong and your queries are not that great they will probably run faster on 2005. We moved from 2000 to 2005 and we had some complex queries that we could not get properly optimized in 2000. When we moved to 2005 it ate the queries up! Clearly the optimizer was making much better decisions out of the box.
I would strongly recommend moving to 2005 unless you have no issues with 2000.

Resources