I have a large XML document in Xml column within SQL Server. I need to basically perform a free text search across the elements in the document.
Would you use
A) SQL Free Text Search
B) A stored procedure that traverses the XML and checks each value of each element
C) Use Lucene.NET to build an Index on the fly and search the index?
Users understand this will be slow to some degree. If the stored procedure wasn't a monster to write I'd lean toward that because its the least to maintain and decreases overall complexity.
The book "Pro SQL Server 2008 XML" has a section on Full-Text indexing of XML data that may be of interest to you. It mentions that when XML data is indexed a special "XML Word Breaker" is used to separate text content from the markup. Essentially this means is that only the content is indexed, not the markup. Full text indexes also support stemming and thesaurus matching.
Just noticed that you are using SQL Server 2005, so you'll have to check if this functionality is supported. I suspect that it is.
Related
We have an file upload system and would like to use the new MSSQL2012 semantic search feature for sql server 2012. Is that possible without using filetables?
This is our schema:
I think there are two questions here.
Can you use Semantic Search without using filetable?
Yes, you can. It can be used on any table with Full-Text indexing turned on.
Here is the list of prerequisites:
link.
Basically you can use it on the data, which is loaded into the database.
The second question is whether your schema benefit from Semantic Search an to what extent.
Looking at your scheema I understand, that your database hold only paths to the documents and their "descriptions". Therefore, you can enable Semantic Search on the columns in your database. It will allow to use Semantic Search on FileName and Description, but not on documents' contents.
In order to use Semantic Search on the contents of these documents you'll need to store these documnets in SQL database. FileTable structure helps this task, although you can choose another way of storing whole documnets in your database.
I have a product table where the description column is fulltext indexed.
The problem is, users frequently search a single word, which happens to be in the noiseXXX.txt files.
We'd like to keep the noise word functionality enabled, but is there anyway to turn it off just for this one column?
I think you can do this in 2008 with the SET STOPLIST=OFF, but I can't seem to find similar functionality in SQL Server 2005.
In SQL Server 2005, noise word lists are applied to the entire server. You can disable noise words for the entire server by deleting the appropriate noise word file and then re-building the full text indices. But I do not believe it is possible in SQL Server 2005 to selectively disable noise words for a single table. See for instance here, here and here.
In SQL Server 2008, FTS moves from using noise word files to stop lists. Stop lists are containers that contain collections of stopwords which are not included in full text indices and replace the functionality of noise word files.
In SQL Server 2008 (compatibility level 100 only) you can create multiple stoplists for a given language, and stoplists can be specified for individual tables. That is, one table could use a given stoplist, a second table could use a different stoplist, and a third could use no stoplists at all. Stoplist settings apply to an entire table, so if you have multiple columns indexed in a single table, they all must use the same stoplist.
So to answer your question, I do not believe it is possible in SQL Server 2005 to selectively disable noise words for individual tables while leaving them on for other tables. If this is a deal-breaker for you, this might be a good opportunity to upgrade your server to SQL Server 2008 or 2012.
I was trying to understand the basic advantage of using XML DataType in SQL Server 2005. I underwent the article here, saying that in case you want to delete multiple records. Serialize the XMl, send it in Database and using below query you can delete it..
I was curious to look into any other advantage of using this DataType...
EDIT
Reasons for Storing XML Data in SQL Server 2005
Here are some reasons for using native XML features in SQL Server 2005 as opposed to managing your XML data in the file system:
You want to use administrative functionality of the database server for managing your XML data (for example, backup, recovery and replication).
My Understanding - Can you share some knowledge over it to make it clear?
You want to share, query, and modify your XML data in an efficient and transacted way. Fine-grained data access is important to your application. For example, you may want to insert a new section without replacing your whole document.
My Understanding - XML is in specific column row, In order to add new section in this row's cell, Update is required, so whole document will be updated. Right?
You want the server to guarantee well-formed data, and optionally validate your data according to XML schemas.
My Understanding - Can you share some knowledge over it to make it clear?
You want indexing of XML data for efficient query processing and good scalability, and the use a first-rate query optimizer.
My Understanding - Same can be done by adding individual columns. Then why XML column?
Pros:
Allows storage of xml data that can be automatically controlled by an xml schema - thereby guaranteeing a certain level of data quality
Many web/desktop apps store data in xml form, these can then be easily stored and queried in the database - so it is a great place to store xml data that an app may need to use (e.g. for configuration settings)
Cons:
Be careful about using xml fields, they may start off as innocent storage but can become a performance nightmare if you want to search, analyse and report on many records.
Also, if xml fields will be added to, changed or deleted this can be slow and leads to complex t-sql.
In replication, the whole xml gets updated even if only one node changes - therefore you could have many more conflicts that cannot easily be resolved.
I would say of the 4 advantages you've listed, these two are critical:
You want to share, query, and modify your XML data in an efficient and transacted way
SQL Server stores the XML in an optimised way that it wouldn't for plain strings, and lets you query the XML in an efficient way, rather than requiring you to bring the entire XML document back to the client. Think how inefficient it is if you want to query 10,000 XML columns, each containing 1k of data. For a small XPath query you would need to return 10k of data across the wire, for each client, each time.
You want indexing of XML data for efficient query processing and good scalability, and the use a first-rate query optimizer
This ties into what I said above, it's far more efficiently stored than a plain text column which would also run into page fragmentation issues.
So let's say I have two databases, one for production purposes and another one for development purposes.
When we copied the development database, the full-text catalog did not get copied properly, so we decided to create the catalog ourselves. We matched all the tables and indexes and created the database and the search feature seems to be working okay too (but been entirely tested yet).
However, the former catalog had a lot more files in its folder than the one we manually created. Is that fine? I thought they would have exact same number of files (but the size may vary)
First...when using full text search I would suggest that you don't manually try to create what the wizard does for you. I have to wonder about missing more than just some data. Why not just recreate the indexes?
Second...I suggest that you don't use freetext feature of sql server unless you have no other choice. I used to be a big believer in freetext but was shown an example of creating a Lucene(.net) index and searching it in comparison to creating an index in SQL Server and searching it. Creating a SQL Server index in comparison to creating a Lucene index is considerably slower and hard to maintain. Searching a SQL Server index is considerably less accurate (poor results) in comparison to Lucene. Lucene is like having your own personal Google for searching data.
How? Index your data (only the data you need to search) in Lucene and include the Primary Key of the data that you are indexing for use later. Then search the index using your language and the Lucene(.net) API (many articles written on this topic). In your search results make sure you return the PK. Once you have identified the records you are interested in you can then go get the rest of the data and/or any related data based on the PK that was returned.
Gotchas? Updating the index is also much quicker and easier. However, you have to roll your own for creating the index, updating the index, and searching the index. SUPER EASY to do...but still...there are no wizards or one handed coding here! Also, the index is on the file system. If the file is open and being searched and you try to open it again for another search you will obviously have some issues...so writing some form of infrastructure around opening and reading these indexes needs to be built.
How does this help in SQL Server? You can easily wrap your Lucene search in a CLR function or proc which can be installed in the database that you can then use as though it were native to your t-SQL queries.
I'm creating news portal site. this saves to many news.Every news has html data. i'm using SQL Server 2005. I have 2 choices.
Save news data to ntext field.
Save news data to html file and save file name to nvarchar field.
What is best way to good performance and quick search operation. If i choose second way, when i search from news, i'm repeat every file and search from each.
What is best?
You have another way?
EDIT
Maybe my news count increasing over than 100,000. Now count is 1000. But SQL Server database size is 60Mb.
Use nvarchar(max), not ntext for storage. Use fulltext search for searching. Use the FILESTREAM storage if the content are documents that have to be accessed by Win32 API.
Querying varbinary(max) and xml Columns (Full-Text Search)
Best Practices for Integrated Full Text Search
SQL Server 2005 Full-Text Queries on Large Catalogs: Lessons Learned
Using FILESTREAM with Other SQL Server Features