Lucene.Net, SQL Server, NHibernate, ASP.NET MVC - sql-server

I am using these technologies: SQL Server 2005, ASP.NET MVC, NHibernate/sharp architecture and would like to mine some text with the final aim of presenting some web based stats . I have several millions of keywords and several millions of documents and would like to run some queries based on these documents indexed by the keywords. I have played a bit with SQL Server’s full text indexing but I am not too impressed. So I am wondering whether Lucene.Net might be an alternative.
I have never used Lucene.Net but understand that it is a 1:1 port of the Java version. So my first question is whether it is worth studying the book ‘Lucene in action’ – provided that Lucene would be the right ‘technology’?
Thanks.
Best wishes,
Christian

Well,
FIRST - update SQL Server. You use a two generations outdated version which had the first implementation of full text search in SQL Server and many (known and fixed) shortcomings.
Second - Lucene may really be better suited. SQL is primarily a database server, and the full text search does a lot of things, but also has a lot of limitations.
But entering Lucene DOES provide a significant complication - distributed transactions, backup handling turn a lot more complicated as they are two systems. SQL 2008 R2 does a much better job here (full text index stored in the database file).
That said, also be careful with performance. You may need a QUITE HIGH END SERVER if you want to run a lot of queries in parallel (which can happen easily with a web application). This may require multiple database servers running read only replications - something SQL Server does a lot easier than Lucene (as in: out of the box).
I suggest you just get Lucene and play with it ;) Not a lot more needed.

Related

Hive vs SQL Server performance

1) I started using hive from last 2 months. I have a same task as that in SQL. I found that Hive is slow and takes more time to execute queries while SQL executes it in very few minutes/seconds.
After executing the task in Hive when I cross check the result in both (SQL and Hive), I found some difference in results (Not all but in some tables).
e.g. : I have one table which has 2012 records, when I executed a task in Hive in the same table in Hive I got 2007 records.
Why it is happening?
2) If I think to speed up my execution in Hive then what should I do for it?
(Currently I am executing all this stuff on single cluster only. If I think to increase the clusters then how many cluster should I need it to increase the performance)
Please suggest me some solution or some good practices so that I can do it keenly.
Thanks.
Hive and SQL Server are not comparable in any way other than the similarity in the syntax of the query language.
While SQL Server is built to be able to respond in realtime from a single machine, hive is for processing large data sets that may span hundreds or thousands of machines.
Hive (via hadoop) has a lot of overhead for starting up a job.
Hive and hadoop will not cache data in memory like sql server does.
Hive has only recent added indexes so most queries end up being a table scan.
If your dataset fits on a single computer you probably want to stick with SQL Server and not hive. Hive performance tuning is mostly based in Hadoop performance tuning although depending on the types of queries you run there can be free performance from using the LazyBinarySerDe.
Hive does have some differences from regular SQL that may be effecting your query. Without more details I can't speculate as to why.
Ignore the "they aren't comparable in any way" comment. If it stores data, it is comparable to any other method of storing data.
But be aware that SQL Server, 13 years ago, had 1000+ people being paid full-time to improve their product. So while that doesn't "Prove" anything, it does increase ones confidence that more work = more results.
More importantly, look for any non-trivial benchmark done on an open source and/or non-relational method of storing data vs one of the mainstream relational databases. You won't find them. That says a lot to me. (Also, mainstream isn't necessary since the current world's fastest data engine isn't even mainstream. But if that level is needed, look at ExoSol.)
If your need is to learn to work with technology at your job and that technology is Hive, my recommendation is to find someone who is really focused on getting the most out of Hive query performance as possible. If there is a Hive query guru out there, find them. But if you need a lot more than what they can give you, you're using the wrong technology.
And if Hive isn't a requirement, I would avoid it and other technologies lacking the compelling business model that will guarantee their survival past 5 years and move them out of niche category they currently exist in (currently 20 times less popular than any mainstream data engine - https://db-engines.com/en/ranking).

Third Party Tools for Monitoring SQL Server Performance

I'm in a situation where I came into a new job and I have to support several legacy systems. The original developer is no longer around. These legacy systems are really hammering away at our SQL Server and killing performance. I know that there are a lot of things that can be done in the code, but rewriting code is really my last resort.
What I'm looking for is some sort of tool that will monitor the queries coming into the server and give recommendations on indexing solutions. I know I can use the SQL Server Profiler but I'm looking for something a little more user friendly and something that can help me make the indexing decisions.
I know I didn't explain it very well, but I'm sure this is a common request. I'd like to make informed decisions on what to index and avoid "shooting from the hip" and indexing everything in sight. Thanks for any recommendations!
You don't need a third party tool for this.
Assuming SQL Server 2005+ as long as you can use SQL Profiler (actually SQL Trace - Don't use the Profiler GUI for this to reduce tracing overhead as much as possible) to collect a representative workload you can use the Database Tuning Advisor to automate analysis of the workload and make indexing recommendations.
You can also use the Missing Index DMVs for a quick overview of areas to investigate but the DTA will do more holistic analysis and take into account possible adverse effects of indexes on data modification statements.
+1 for Martin's answer, but since you asked about 3rd party tools, I'll mention one of my favorites (and no, I don't work for the company). Ignite for SQL Server does an excellent job of analyzing server activity in terms of wait time analysis. It won't make recommendations for you, but it will quickly identify the worst performing queries where you need to focus your effort.
SQL Server 2005+ has a lot of DMV's (Dynamic Management views) that you can query to get server info, as well as the Profiler / SQL Trace tool.
We administer several large database servers.
Idera is a good tool to manage multiple database servers easily.
I think you'd make a much better DBA if you learn more about the inbuilt functionality of SQL server.
Have a browse of
http://msdn.microsoft.com/en-us/library/ms188754.aspx
to find out more about DMV's and functions.
Another common issue with performance could be your indexes.
Theres a great tutorial that combines the DMV's with improving indexes here:
http://searchsqlserver.techtarget.com/tip/Using-dynamic-management-views-to-improve-SQL-Server-index-effectiveness
Idera is really worth checking out though as a good starting point. Combined with DMV's & SQL trace there shouldn't be much you won't be able to fix.
Idera just takes most of the leg work out of doing things.
http://www.idera.com/Content/Home.aspx
Idera: SQL Diagnostic Manager

What database to use for big data storage and manipulation?

I have to make a decision of which database server to use for my next project, but the simple decision to use MySQL like almost all the projects I did is harder now, because I expect very much records.
The database will store a user list, some other irrelevant tables, and the last one, some user-collected data. Let's say, if I have 6000 users responding to a quiz about each other. Simple math shows that from those users, if each one completes the quiz about everyone (and in my project that is 99% sure that will happen) I'll end up with 35.99million records(they will exclude themselves and in this particular situation the operation is 6000*5999). Unfortunately 6000 maybe is a small number, the real one growing day by day.
What to choose? MySQL and maybe if things go well and the project grows to expand it in a cluster? PostgreSQL, MSSQL? Oracle?
I've read about all of them, each one has it's pros and cons, but still don't know what to choose. The advantage of MySQL and PostgreSQL is of course, the starting price of $0 which is pretty nice in a usual self-funded startup.
Any opinions, pieces of advice? If you encountered this situation in your experience as developers, I'd love to hear from you.
These days, free isn't something that differenciates between databases any more. Both Oracle and SQL Server have free versions, but the limitations is resources - 4 GB database, RAM & single CPU utilization. Millions of records is not a concern - it's what datatypes you're using.
I saw the OPs comment about not liking MS software - that's your prerogative, but using the free versions of either Oracle or SQL Server do benefit from seamless transition to upscale versions of the respective database.
Personally, my choice would be either Oracle or SQL Server because of IMHO, real feature considerations like hierarchical query support, subquery factoring/CTE, packages (long before I get concerned with functions/procedures), full text searching, xml support, etc.
MySQL will handle 35 million records no problem. Worry about scalability when you get there. You can easily add raid hard disks backing your database tables, and if you really start getting big you can get a compellant SAN that will scream... Don't worry about the DB engine as much as the underlying hardware.. MySQL rocks for us with millions of records.
I've had no problems handling tables as large as 36,000,000 rows on MySQL and Oracle.
Just be sure that you index the proper columns, run EXPLAINs for your queries, and maintain proper design principles.
Most of the truly large scale web properties use a distributed key-value store. That said, 35 million is large, but not that large. With most modern databases, your main two scaling worries should be throughput and what happens when no single box can contain your entire database anymore. And both of these problems can be solved to some degree for any database you choose to use. (Caching, replication, sharding, etc.)
Use MySQL until you can't anymore. At that point, you ought to be rolling in dough anyways and you now have a very desirable problem.
Use MySQL as it's free and you have experience with it.
Besides in my opinion it matters more on how you design the tables than which database you use.
35 million records can be easily handled by MS SQL Server (assuming proper database design, indices, etc.). You can start with the free SQL Server Express edition and later, if you need, you can upgrade to the full version which supports clustering, etc.
SQL Server Express does have some limitations - single CPU, 1 GB memory, max 4 GB database size and a few other things. I'm not sure how quickly these limitations will become a problem but you can always move to the full version when you run into them.
MySQL(i) & Postgre
0$ of costs
large community
many tutorials
well documentated
MSSQL
You can get "money" from MS if you promote that you are using MSSQL (secret information from some companies I worked for)
MS tools work very well
Complete tool set from C# IDE over .NET lib to Windows Server 2003
Oracle
Professional and commercial provider
Used by many large companies (I also heard about Blizzard (World of Warcraft) using Oracle)
- expensive
The final decision depends on the very special requirements of your project.
Make yourself a quick list of things , that ARE IMPORTANT for your project (e.g. quick performed queries) and look up which Database pros are matching the most to your requirements.
Everything is about design. SQL Database are some kind of cars, you just have to know which component has to be placed here and which there.
Make a clear design and you won't struggle with any of them.
May be you can test Firebird
Blog post about big Firebird database here
MySQL licence is here (not allways free).
Postgresql and Firebird are free.
First of all, don't think about performance. Premature optimization being the root of all evil and all that. You can always throw more hardware and/or tuning at it later.
All of the mentioned should perform nicely if tuned/maintained correctly. I'd focus on manageability and familiarity. IMHO open source databases excels on manageability (perhaps not the best GUIs, but the CLI has been my home for a long long time).
And if the database becomes the bottleneck, why limit yourself to those choices? How about a key-value distributed database? Or perhaps serialize data directly to disk? Storing data outside of a RDBMS, while often frowned upon, might be the correct path. Or simply use the common route of denormalization.
Always remember not to optimize prematurely.
As far as opinions go (since you specifically asked for it) I favor open source databases, specifically PostgreSQL. It's rock solid, fast and very well-featured. And even with (relatively) large datasets it has performed superbly on mediocre hardware (some tuning involved, of course, but you can't skip that step no matter which db you end up choosing).

SQL Server 2008 Full Text Search (FTS) versus Lucene.NET

I know there have been questions in the past about SQL 2005 versus Lucene.NET but since 2008 came out and they made a lot of changes to it and was wondering if anyone can give me pros/cons (or link to an article).
SQL Server FTS is going to be easier to manage for a small deployment. Since FTS is integrated with the DB, the RDBMS handles updating the index automatically. The con here is that you don't have an obvious scaling solution short of replicating DB's. So if you don't need to scale, SQL Server FTS is probably "safer". Politically, most shops are going to be more comfortable with a pure SQL Server solution.
On the Lucene side, I would favor SOLR over straight-up Lucene. With either solution you have to do more work yourself updating the index when the data changes, as well as mapping data yourself to the SOLR/Lucene index. The pros are that you can easily scale by adding additional indexes. You could run these indexes on very lean linux servers, which eliminates some license costs. If you take the Lucene/SOLR route, I would aim to put ALL the data you need directly into the index, rather than putting pointers back to the DB in the index. You can include data in the index that is not searchable, so for example you could have pre-built HTML or XML stored in the index, and serve it up as a search result. With this approach your DB could be down but you are still able to serve up search results in a disconnected mode.
I've never seen a head-to-head performance comparison between SQL Server 2008 and Lucene, but would love to see one.
I built a medium-size knowledge base (maybe 2GB of indexed text) on top of SQL Server 2005's FTS in 2006, and have now moved it to 2008's iFTS. Both situations have worked well for me, but the move from 2005 to 2008 was actually an improvement for me.
My situation was NOT like StackOverflow's in the sense that I was indexing data that was only refreshed nightly, however I was trying to join search results from multiple CONTAINSTABLE statements back in to each other and to relational tables.
In 2005's FTS, this meant each CONTAINSTABLE would have to execute its search on the index, return the full results and then have the DB engine join those results to the relational tables (this was all transparent to me, but it was happening and was expensive to the queries). 2008's iFTS improved this situation because the database integration allows the multiple CONTAINSTABLE results to become part of the query plan which made a lot of searches more efficient.
I think that both 2005 and 2008's FTS engines, as well as Lucene.NET, have architectural tradeoffs that are going to align better or worse to a lot of project circumstances - I just got lucky that the upgrade worked in my favor. I can completely see why 2008's iFTS wouldn't work in the same configuration as 2005's for the highly OLTP nature of a use case like StackOverflow.com. However, I would not discount the possibility that the 2008 iFTS could be isolated from the heavy insert transaction load... but it also sounds like it could be as much work to accomplish that as move to Lucene.NET ... and the cool factor of Lucene.NET is hard to ignore ;)
Anyway, for me, the ease and efficiency of SQL 2008's iFTS in the majority of situations probably edges out Lucene's 'cool' factor (though it is easy to use, I've never used it in a production system so I'm reserving comment on that). I would be interesting in knowing how much more efficient Lucene is (has turned out to be? is it implemented now?) in StackOverflow or similar situations.
This might help:
https://blog.stackoverflow.com/2008/11/sql-2008-full-text-search-problems/
Haven't used SQL Server 2008 personally, though based on that blog entry, it looks like the full-text search functionality is slower than it was in 2005.
we use both full-text-search possibilities, but in my opinion it depends on the data itself and your needs.
we scale with web-servers, and therefore i like lucene, because i don't have that much load on the sql-server.
for starting at null and wanting to have a full-textsearch i would prefer the sql-server solution, because i think it is really fast to get results, if you want lucene you have to implement more at start (and also get some know-how).
One consideration that you need to keep in mind is what kind of search constraints you have in addition to the full-text constraint. If you are doing constraints that lucene can't provide, then you will almost certainly want to use FTS. One of the nice things about 2008 is that they improved the integration of FTS with standard sql server queries so performance should be better with mixed database and FT constraints than it was in 2005.

When is it time to change database backends?

Is there a general rule of thumb to follow when storing web application data to know what database backend should be used? Is the number of hits per day, number of rows of data, or other metrics that I should consider when choosing?
My initial idea is that the order for this would look something like the following (but not necessarily, which is why I'm asking the question).
Flat Files
BDB
SQLite
MySQL
PostgreSQL
SQL Server
Oracle
It's not quite that easy. The only general rule of thumb is that you should look for another solution when the current one can't keep up anymore. That could include using different software (not necessarily in any globally fixed order), hardware or architecture.
You will probably get a lot more benefit out of caching data using something like memcached than switching to another random storage backend.
If you think you are going to ever need one of the heavyweights (SqlServer, Oracle), you should start with one of those at the beginning. Data migrations are extremely difficult. In the long run it will cost you less to just start at the top and stay there.
I think you're being overly specific in your rankings. You can pretty much start with flat files and the like for very small data sets, go up to something like DBM for slightly bigger ones that don't require SQL-like syntax, and go to some kind of SQL database after that.
But who wants to do all that rewriting? If the application will benefit from access to joins, stored procedures, triggers, foreign key validation, and the like--just use a SQL database regardless of the dataset size.
Which one should depend more on the client's existing installations and what DBA skills are available than on the amount of data you're holding.
In other words, the size of your database is far from the only consideration, and maybe not the most important one.
There is no blanket answer to this, but ALMOST always, using flat files is not a good idea. You have to parse through them (i suppose) and they do not scale well. Starting with a proper database, like Oracle or SQL Server (or MySQL, Postgres if you are looking for free options) is a good idea. For very little overhead, you will save yourself a lot of effort and headache later on. They also allow you to structure your data in a non-stupid fashion, leaving you free to think of WHAT you will do with the data rather than HOW you will be getting it in/out.
It really depends on your data, and how you intend to use it. At one of my previous positions, we used Postgres due to the native geo-location and timezone extensions which existed because it allowed us to manage our data using polygonal datatypes. For us, we needed to do that, and we also wanted to use stored procedures, views and the like.
Now, another place I worked at used MySQL simply because the data was normalized, standard row by row data.
SQL Server, for a long time, had a 4gb database limit (see SQL Server 2000), but despite that limitation it remains a very stable platform for small to medium applications for which the old data is purged.
Now, from working with Oracle and SQL Server 05/08, all I can tell you is that if you want the creme of the crop for stability, scalability and flexibility, then these two are your best bet. For enterprise applications, I strongly recommend them (merely because that's what we use where I work now).
Other things to consider:
Language integration (ASP.NET session storage, role management, etc.)
Query types (Select, Update, Delete) [Although this is more of a schema design issue, not a DBMS issue)
Data storage requirements
Your application's utilization of the database is the most critical ones. Mainly what queries are used most often (SELECT, INSERT or UPDATE)?
Say if you use SQLite, it is gears for smaller application but for "web" application you might a bigger one like MySQL or SQL Server.
The way you write scripts and your web application platforms also matters. If you're developing on a Microsoft platform, then SQL Server is a better alternative.
Typically, I go with what is commonly accepted by whichever framework I am using. So, if I'm doing .NET => SQL Server, Python (via Django or Pylons) => MySQL or SQLite.
I almost never use flat files though.
There is more to choosing an RDBMS solution that just "back end horsepower". The ability to have commitment control, for example, so you can roll back a failed transaction is one. reason.
Unless you are in the megatransaction rate application, most database engines would be adequate - so it becomes a question of how much you want to pay for the software, whether it runs on the hardware and operating system environment you want, and what expertise you have in managing that software.
That progression sounds painful. If you're going to include MS products (especially the for-pay SQL Server) in there anywhere, you may as well use the whole stack, since you only have to pay for the last of these:
SQL Server Compact -> SQL Server Express -> SQL Server Enterprise (clustered).
If you target your app at SQL Server Compact initially, all your SQL code is guaranteed to scale up to the next version without modification. If you get bigger than SQL Server Enterprise, then congratulations. That's what they call a good problem to have.
Also: go back and check the SO podcasts. I believe they talked about this briefly.
This question depends on your situation really.
If you have control over the server you're deploying to and you can install whatever services you need, then the time to install a MySql or MSSQL Express server and code against an existing database framework VERSUS coding against flat file structure is not worth the effort of considering.
What about FireBird? Where would that fit into that list?
And lets not forget the requirements that the "customer" of your solution must also have in place. If your writing a commercial application for a small companies, then Oracle might not be a good choice... but if your writing a customized solution for a large enterprise which must share data among multiple campuses, and has a good sized IT department then the decision of Oracle vs Sql Server would come down to what does the customer most likely already have deployed.
Data migration nowdays isn't that bad since we have those great tools from Embarcadero, so I would instead let the customer needs drive the decision.
If you have the option SQL Server is a good choice from the word go, predominantly because you have access to solid procedures and functions and the database backup facilities are totally reliable. Wrapping up as much as your logic as you can inside the database itself (rather than in whatever language you are using) helps security and performance - indeed there's an good argument to be made for always using procedures for insert/update logic as these make you invulnerable to injection attacks.
If I have the choice the only time I'd consider MySQL in preference is with a large, fairly simple, database predominantly used for read access. This isn't to decry MySQL which has improved markedly of late and I happily use if I don't have the choice, but for more complex systems with update/insert activity MSSQL is generally the superior option.
I think your list is subjective but I will play your game.
Flat Files
BDB
SQLite
MySQL
PostgreSQL
SQL Server
Oracle
Teradata

Resources