Is it possible to use Sql Server XML columns as a substitute for a real Document DB (such as Couch or Mongo) ?
If I were to create a table with a guid PK Id and an XML column for the document.
What would be the main problems compared to using a document DB?
Sql Server supports indexing over XML columns so querying should not be completely horrible?
You've got several questions in here:
Is it possible to use Sql Server XML columns as a substitute for a real Document DB (such as Couch or Mongo) ? Yes, you can use it as a substitute, but no, you probably wouldn't be satisfied with performance if you're exclusively storing XML and not leveraging any of SQL Server's relational tools.
If I were to create a table with a guid PK Id and an XML column for the document. What would be the main problems compared to using a document DB? In a nutshell, scaling out. SQL Server doesn't scale this kind of thing out well. You can do it with replication, but it's painful to manage relative to a "real" Document DB.
Sql Server supports indexing over XML columns so querying should not be completely horrible? The problem is that SQL Server's XML indexes can take several times the storage space of the original data. These indexes can't be maintained online (as in defrags), so you end up with locking issues during maintenance windows.
I'm doing some experimenting with this on:
http://rogeralsing.com/2011/03/02/linq-to-sqlxml-projections/
Query speed is 'decent' , it's nothing I'd use for scaling.
But the joy of schema free storage running on standard infrastructure is quite nice.
Yes, you can. Storing a document inside a SqlServer XML column will work and if you use standard XML serialization that will leave you with a decent ACID complant key/value store. Also, it will allow you to do queries on it with relative ease and you can join the results to data that you store in a more relational way. We do so, it works. If you store content in XML fields, storage demands are a lot lower than using NTEXT and querying it will be more flexible and faster.
What SqlServer will not get you (comparing to mongo) is the seamless failover of replica-sets an the autosharding of mongo. Also, atomic operations like incrementing a specific property deep inside a document is hard (though not impossible with the XQuery update function). Updates tend to be faster on most NoSql databases, because they are more relaxed on the "data is only safe on disk" principle.
Yes, it is possible. As to whether it's a good idea, this is just my 2 cents...
Before the XML datatype came along I worked on a system storing XML in an NTEXT column - that wasn't pleasant, and to get any real use out of the data meant shredding some of that data out into relational form.
OK, the XML datatype now makes it easier to query an XML blob and to extract certain values/index them. But personally, in general, I wouldn't. I'm not saying never use XML as there are scenarios for that - rather if that's all your planning on doing then I'd be thinking "is this the right tool for the job". Using a RDBMS as a document database makes me feel a bit uneasy. Whereas something like MongoDB has been built from the ground up as a document database.
In all honesty, I haven't done any performance testing on storing data as XML so I can't give you an indication of what performance would be like. Would be interested to know how this performs at scale.
Related
It's an old question I know, but with SQL Server 2012 is it finally ok to store files in the database, or should they really be kept in the filesystem with only references to them in the database?
If storing them in the database is considered acceptable these days, what is the most effective way to do it?
I'm planning to apply encryption so I appreciate processing will not be lightning fast.
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee photo in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee photo, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!
There's still no simple answer. It depends on your scenario. MSDN has documentation to help you decide.
There are other options covered here. Instead of storing in the file system directly or in a BLOB, you can use the FileStream or File Table in SQL Server 2012. The advantages to File Table seem like a no-brainier (but admittedly I have no personal first-hand experience with them.)
The article is definitely worth a read.
You might read up on FILESTREAM. Here is some info from the docs that should help you decide:
If the following conditions are true, you should consider using FILESTREAM:
Objects that are being stored are, on average, larger than 1 MB.
Fast read access is important.
You are developing applications that use a middle tier for application logic.
For smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.
It's an old question I know, but with SQL Server 2012 is it finally ok to store files in the database, or should they really be kept in the filesystem with only references to them in the database?
If storing them in the database is considered acceptable these days, what is the most effective way to do it?
I'm planning to apply encryption so I appreciate processing will not be lightning fast.
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee photo in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee photo, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!
There's still no simple answer. It depends on your scenario. MSDN has documentation to help you decide.
There are other options covered here. Instead of storing in the file system directly or in a BLOB, you can use the FileStream or File Table in SQL Server 2012. The advantages to File Table seem like a no-brainier (but admittedly I have no personal first-hand experience with them.)
The article is definitely worth a read.
You might read up on FILESTREAM. Here is some info from the docs that should help you decide:
If the following conditions are true, you should consider using FILESTREAM:
Objects that are being stored are, on average, larger than 1 MB.
Fast read access is important.
You are developing applications that use a middle tier for application logic.
For smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.
I was trying to understand the basic advantage of using XML DataType in SQL Server 2005. I underwent the article here, saying that in case you want to delete multiple records. Serialize the XMl, send it in Database and using below query you can delete it..
I was curious to look into any other advantage of using this DataType...
EDIT
Reasons for Storing XML Data in SQL Server 2005
Here are some reasons for using native XML features in SQL Server 2005 as opposed to managing your XML data in the file system:
You want to use administrative functionality of the database server for managing your XML data (for example, backup, recovery and replication).
My Understanding - Can you share some knowledge over it to make it clear?
You want to share, query, and modify your XML data in an efficient and transacted way. Fine-grained data access is important to your application. For example, you may want to insert a new section without replacing your whole document.
My Understanding - XML is in specific column row, In order to add new section in this row's cell, Update is required, so whole document will be updated. Right?
You want the server to guarantee well-formed data, and optionally validate your data according to XML schemas.
My Understanding - Can you share some knowledge over it to make it clear?
You want indexing of XML data for efficient query processing and good scalability, and the use a first-rate query optimizer.
My Understanding - Same can be done by adding individual columns. Then why XML column?
Pros:
Allows storage of xml data that can be automatically controlled by an xml schema - thereby guaranteeing a certain level of data quality
Many web/desktop apps store data in xml form, these can then be easily stored and queried in the database - so it is a great place to store xml data that an app may need to use (e.g. for configuration settings)
Cons:
Be careful about using xml fields, they may start off as innocent storage but can become a performance nightmare if you want to search, analyse and report on many records.
Also, if xml fields will be added to, changed or deleted this can be slow and leads to complex t-sql.
In replication, the whole xml gets updated even if only one node changes - therefore you could have many more conflicts that cannot easily be resolved.
I would say of the 4 advantages you've listed, these two are critical:
You want to share, query, and modify your XML data in an efficient and transacted way
SQL Server stores the XML in an optimised way that it wouldn't for plain strings, and lets you query the XML in an efficient way, rather than requiring you to bring the entire XML document back to the client. Think how inefficient it is if you want to query 10,000 XML columns, each containing 1k of data. For a small XPath query you would need to return 10k of data across the wire, for each client, each time.
You want indexing of XML data for efficient query processing and good scalability, and the use a first-rate query optimizer
This ties into what I said above, it's far more efficiently stored than a plain text column which would also run into page fragmentation issues.
We have a repository of tables. Around 200 tables, each table can be thousands of rows, all tables are originally in Excel sheets.
Each table has a different scheme. All data is text or numbers.
We would like to create an application that allows free text search on all tables (we define which columns will be searched in each table) efficiently - speed is important.
The main dilemma is which DB technology we should choose.
We created a mock up by importing all tables to MS SQL Server, and creating a full text index over them. The search is done using the CONTAINS keyword. This solution works well for a small number of tables, but it doesn't scale.
We thought about a NoSQL solution, but we don't yet have any experience in it.
Our limitations (which unfortunately I can not effect): Windows servers only. But we can install on them whatever we want.
Thank you.
Check out ElasticSearch! It's a search server based on Apache Lucene and has a clean REST- and JavaScript-based API. Although it's used usually as a search-index for a primary database, it can also be used stand-alone. So you may want to write a backup routine for a few of your tables/data and try it out.
http://www.elasticsearch.org/
http://en.wikipedia.org/wiki/ElasticSearch
Comparison of ElasticSearch and Apache Solr (another Lucene-based search server):
https://docs.google.com/present/view?id=dc6zhtt5_1frfxwfff&pli=1
I have an object I am using to store document meta data into a table. The body text of the document can be very large, sometimes > 2GB so I will be storing it into a nvarchar(max) field in SQL 2008. I'll use SQL 2008 later to index that field. I won't be using filestreams because they are very restrictive to the database and prevents certain types of concurrency locking schemes.
This object is exposed to the developer via LinqToSQL. My concern is that the field will be to large and I have seen .Net bomb out with an OutOfMemory exception if the text is > 1.5 GB.
So I am wondering, can I treat this blob as a stream with Linq? Or do I have to bypass Linq altogether if I want to use a blob?
Given the answer to "Can a LINQ query retrieve BLOBs [...]" I suspect you're out of luck. The System.Data.Linq.Binary type doesn't have any mechanism for streaming - basically it's just an immutable byte array representation.
There may be some deep LINQ mojo you could invoke, but I suspect it would really have to be pretty deep.
It's possible that the Entity Framework would handle it - I haven't investigated that.
I ended up writing my own method around linqtoSql utlising the write method avaiable to varchar(max) objects in SQL. This allows developers to chunk inserts into the DB for large data types.