I am building a search engine and I finished the first phase which is spidering (fetching html documents and parsing each document to get the other links). Now I must index the content of html documents. First of all I decided to use DBMS (like SQL Server) for this purpose but I found another library called Lucene.NET.
What is the difference between lucene.NET and SQL Server and which one is better to use to index html documents? I read alot about Lucene.Net and I surprised that it gives better performance than SQL Server. Can any one explain this to me?
SQL Server is a general purpose RDBMS that is not optimized for very fast text indexing (yes, it has full text indexes, but it does lots of other things at the same time).
Lucene.NET is not a RDBMS and its main function is fast text indexing.
Not that surprising it is better at it than SQL Server.
Related
As i need to synchronise a CouchDB with a SQL Server i need your help. I'm totally new to this and I don't really know a proper way to implement this. Is it even possible without typing thousands of lines of code? If it is, what's the easiest way to do that?
Moving data one way or the other in a specific case should be straight-forward. Read and parse the changes feed, convert the JSON documents to SQL statements that you then execute on the SQL Server side.
The general case (bi-directional, continuous sync between an MVCC and a non-MVCC database) is a hard problem without keeping extra state somewhere.
CouchDB has first-order support for conflicted documents, SQL Server does not. If you need your synchronisation to be bi-directional and stand up to concurrent modification of documents you will have a problem: CouchDB will quite happily accept multiple versions of the same document, a concept which has no direct equivalent on the SQL-Server side.
For example: Microsoft SQL Server vs. CouchDB.
The main benefit for me with CouchDB is that you can access it from pretty much anywhere! What advantages does a document based database have over a relational one?
Where would a document based db be a better choice over a relational?
I wouldn't say "accessing it from anywhere" is an advantage of CouchDB over SQL Server. Both are fully accessible from a variety of clients.
The key differentiating factor is the fundamental concept of how data is persisted as tables & columns (SQL Server) versus documents (CouchDB). In addition, CouchDB is designed to leverage multiple copies with replication/map-reduce in a highly forgiving fashion. SQL Server can do the same level of fault tolerance but true map-reduce is non-existant in it (it's ability to deal with sets mimics the capabilities fundamentally however - see GROUPING SETS keyword).
You should note this post which really shows that map reduce has its place, but you need to pick the right tool for the job:
http://gigaom.com/2009/04/14/mapreduce-vs-sql-its-not-one-or-the-other/
ColdFusion 9's full text search is now based on Apache Lucene Solr (or Verity, but it has too much limitations). We also use SQL Server.
Which one's better? Which one's easier?
UPDATE: going to use for... searching against the name & description fields of the Products table.
Thanks!
Here's my 2 cents tested with ~ 3 000 000 of images with captions (primary key + image caption text from 100 to 500 chars):
CF9's Solr implementation is fast in returning results, really easy to setup, fairly fast during building index.
SQL Server 2005 FTS wasn't good enough, tried it some time ago and didn't put it in production. SQL Server 2008 FTS is much better though, currently using it on our application. But basic setup had to be adjusted in order to get high level results.
Based on experiences of other colleagues working with huge data sets and applications mostly based on search and finding things I made my top list:
Lucene
Tuned SQL Server 2008 FTS
Solr
SQL Server 2005
Of course CF9's Solr is winner here if you are chasing fast setup since you need 3 tags to finish the job and get awesome results.
The important question: What are you going to use it for?
Can't pick the right tool for the job when you don't know what the job is ;)
I have a SQL Server 2008 database with a large amount of varchar(max) data that is currently indexed with full-text search. Unfortunately, row-level compression in SQL Server 2008 does not support LOB data.
I am toying with the idea of using SQLCLR to compress the data and a custom iFilter to enable the data to be indexed with full-text search.
I'm interested in getting some feedback on this idea. Could it work? Has it been done before? What are the possible pitfalls? Can you recommend an better solution?
A long time ago, I built a mini-SharePoint, which would compress incoming files using a zip library, and store the bytes in a varbinary(max) column. Since the spec called for metadata as opposed to actual file contents, I didn't have to worry about Full Text Search.
You could achieve the same thing with CLR now. Pitfalls would be the CPU load during decompression of data for indexing during the search, but CPUs are fast now.
Option two? Buy more storage.
I'm storing papers in SQL Server 2005 and am looking for a way to paste in the text of a paper and then search for potential plagiarism (copied content) in the database.
What's the best way to go about this? Is there a way to get a gauge for the extent to which something is similar to something else using full-text indexing, for several paragraphs of content?
why don't you install google desktop and have it only index that one directory
then you can have google do the indexing for you
This is not really the sort of problem that full-text indexing in SQL Server is designed to solve. There's nothing built in to SQL Server that you can really use to help with this.
There are a number of specialised plagiarism detection tools, which a Google search will turn up for you. That's probably your best bet.