In a SAN environment, we would have multiple storage devices (say each of them with 1TB), so cumulatively the formed SAN network would give a 100's of GBs of storage capacity.
Which is the software that is responsible to splice this storage capacity to each VMs (say 500GB for each VMs)? Where does it reside?
I am finding it hard to picture this concept.
Depending on various technologies there are multiple ways to do this. For example, in block-storage environments, LUNs from different storage systems can be concatenated/striped/mirrored/RAIDed by a volume manager software on the target server. The same effect can be achieved by hardware virtualisation on storage systems: for example, one of the storage device can work as "roof" for all the rest of devices (also, look at the thin-provisioning topic). In NAS world, it's possible to use build big trees of filesystems using different mount-points for different storage systems.
I would like to ask you, what are the pros/cons of having a distributed file system off the shelf (like HadoopFS) over just mounting a drive over the network on linux? As I understand we will achieve the same with these two approaches: the same data will be available on many remote locations.
Cheers!
Distributed filesystems provide many benefits like automatic backup or distributions on some rules (you can, say, add many new elements to your storage and that operation will be transparent for your applications using the storage).
Mounting drives can become a pain one day, when one of your elements in the network gets off on some reason, while your applications rely on it.
I need to setup a data storage which can store PB level of files (files are mostly small json, images and csv files, but some of them can be ~100MB binary files).
I am looking into distributed data storage which is master-less and no-single-point-of-failure.
And I found Riak and GlusterFS.
I want to ask anyone of you have used both of them before?
I know that there interface (DB/Map) is very different.
But seems to me that they are both use hashing and similar distributed tech.
Will they have similar performance, consistency and availability?
We are running a 17 node (24GB RAM, 2T disk) Riak cluster with a Bitcask backend, storing around 1 billion 3k objects. This setup is performant but very resource intensive. We are considering moving away from Riak to GlusterFS as performance is not that important for us. Perhaps using LevelDB as a backend would also mitigate our worries.
ATM the self healing properties of Riak seem stronger and the configuration seem a tad easier. In your case I'd be more comfortable storing 100MB files on GlusterFS.
Storing larger files like the 100MB files you mention would not be the right choice for plain OSS Riak.
What you'd really should use in that case is the newly announced RiakCS http://basho.com/products/riakcs/ from Basho instead.
The choice depends mostly on requirements. Generally I'd recommend Riak if you do not actually need a real filesystem (with mounting points, ACLs management and so on) and just gonna use or serve files programatically, and GlusterFS otherwise.
I'm not sure if my question was at all clear or not, so let me just dive in and give you the long version:
Someone said recently, when discussing high-volume web applications, that "disk is the new tape". Website administrators use huge clusters of memcache servers to make disk I/O-free round trips between the client and the server. In order to accomplish this, application developers are having to treat RDBMS's like generic data stores, tossing out valuable features like foreign key constraints, check constraints, and cascading UPDATES and DELETES.
But what if you could put your memcache cluster on the other side of your db interface, outside the realm of the application software (PHP, Ruby, Python...), and inside the RDBMS? Think of it as a large, distributed, in-memory database cache. To make it clear, I'm talking about having the type of memcache cluster that can store pretty much an entire database in memory, guaranteeing 100% cache hit rate when reading from the database, regardless of the parameters of your SELECT statement. Then, when writing application code, you can not only forget about memory caching, you can start dealing with the data store as if it were a Relational Database Management System again, and use normalization and JOINS the way RDBMS's traditionally encourage.
The performance gains, while maintaining/restoring the data integrity benefits of RDBMS's, might be worth looking into.
Does anyone know of a project where someone has done this already, perhaps as an open source project wherin they might modify an open source RDBMS such as PostGRE or MySQL? I haven't found anything ye, and I have no idea how these programs are structured or whether it would be possible to even implement such a storage engine.
You should review the CAP theorom:
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Keep in mind, if one of those memcache servers rebooted, you would be a bit miffed. Even if you used all memcache servers you would still be limited by the network speed, and then possibly network congestion would become an issue.
There are other devices such as fusionIO cards and other pure RAM harddisks that also solve this issue, such as hypersystem's ram disks. ooohh,. I wish my company could afford a dozen of these...
-daniel
The IT department where I work is trying to move to 100% virtualized servers, with all the data stored on a SAN. They haven't done it yet, but the plan eventually calls for moving the existing physical SQL Server machines to virtual servers as well.
A few months ago I attended the Heroes Happen Here launch event, and in one of the SQL Server sessions the speaker mentioned in passing that this is not a good idea for production systems.
So I'm looking for a few things:
What are the specific reasons why this is or is not a good idea? I need references, or don't bother responding. I could come up with a vague "I/O bound" response on my own via google.
The HHH speaker recollection alone probably won't convince our IT department to change their minds. Can anyone point me directly to something more authoritative? And by "directly", I mean something more specific than just a vague Books OnLine comment. Please narrow it down a little.
I can say this from personal experience because I am dealing with this very problem as we speak. The place I am currently working as a contractor has this type of environment for their SQL Server development systems. I am trying to develop a fairly modest B.I. system on this environment and really struggling with the performance issues.
TLB misses and emulated I/O are very slow on a naive virtual machine. If your O/S has paravirtualisation support (which is still not a mature technology on Windows) you use paravirtualised I/O (essentially a device driver that hooks into an API in the VM). Recent versions of the Opteron have support for nested page tables, which removes the need to emulate the MMU in software (which is really slow).
Thus, applications that run over large data sets and do lots of I/O like (say) ETL processes trip over the achilles heel of virtualisation. If you have anything like a data warehouse system that might be hard on memory or Disk I/O you should consider something else. For a simple transactional application they are probably O.K.
Put in perspective the systems I am using are running on blades (an IBM server) on a SAN with 4x 2gbit F/C links. This is a mid-range SAN. The VM has 4GB of RAM IIRC and now two virtual CPUs. At its best (when the SAN is quiet) this is still only half of the speed of my XW9300, which has 5 SCSI disks (system, tempdb, logs, data, data) on 1 U320 bus and 4GB of RAM.
Your mileage may vary, but I'd recommend going with workstation systems like the one I described for developing anything I/O heavy in preference to virtual servers on a SAN.
Unless your resource usage requirements are beyond this sort of kit (in which case they are well beyond a virtual server anyway) this is a much better solution. The hardware is not that expensive - certainly much cheaper than a SAN, blade chassis and VMWare licencing. SQL Server developer edition comes with V.S. Pro and above.
This also has the benefit that your development team is forced to deal with deployment right from the word go - you have to come up with an architecture that's easy to 'one-click' deploy. This is not as hard as it sounds. Redgate SQL Compare Pro is your friend here. Your developers also get a basic working knowledge of database administration.
A quick trip onto HP's website got me a list price of around $4,600 for an XW8600 (their current xeon-based model) with a quad-core xeon chip, 4GB of RAM and 1x146 and 4x73GB 15k SAS hard disks. Street price will probably be somewhat less. Compare this to the price for a SAN, blade chassis and VMware licensing and the cost of backup for that setup. For backup you can provide a network share with backup where people can drop compressed DB backup files as necessary.
EDIT: This whitepaper on AMD's web-site discusses some benchmarks on a VM. From the benchmarks in the back, heavy I/O and MMU workload really clobber VM performance. Their benchmark (to be taken with a grain of salt as it is a vendor supplied statistic) suggests a 3.5x speed penalty on an OLTP benchmark. While this is vendor supplied one should bear in mind:
It benchmarks naive virtualisation
and compares it to a
para-virtualised solution, not
bare-metal performance.
An OLTP benchmark will have a more
random-access I/O workload, and will
spend more time waiting for disk
seeks. A more sequential disk
access pattern (characteristic of
data warehouse queries) will have a
higher penalty, and a memory-heavy
operation (SSAS, for example, is a
biblical memory hog) that has a
large number of TLB misses will also
incur additional penalties. This
means that the slow-downs on this
type of processing would probably be
more pronounced than the OLTP
benchmark penalty cited in the whitepaper.
What we have seen here is that TLB misses and I/O are very expensive on a VM. A good architecture with paravirtualised drivers and hardware support in the MMU will mitigate some or all of this. However, I believe that Windows Server 2003 does not support paravirtualisation at all, and I'm not sure what level of support is delivered in Windows 2008 server. It has certainly been my experience that a VM will radically slow down a server when working on an ETL process and SSAS cube builds compared to relatively modest spec bare-metal hardware.
SAN - of course, and clustering, but regarding Virtualization - you will take a Performance Hit (may or may not be worth it to you):
http://blogs.technet.com/andrew/archive/2008/05/07/virtualized-sql-server.aspx
http://sswug.org has had some notes about it in their daily newsletter lately
I wanted to add this series of articles by Brent Ozar:
Why Your Sysadmin Wants to Virtualize Your Servers
Why Would You Virtualize SQL Server?
Reasons Why You Shouldn't Virtualize SQL Server
It's not exactly authoritative in the sense I was hoping for (coming from the team that builds the server, or an official manual of some kind), but Brent Ozar is pretty well respected and I think he does a great job covering all the issues here.
We are running a payroll system for 900+ people on VMWare with no problems. This has been in production for 10 months. It's a medium sized load as far as DB goes, and we pre-allocated drive space in VM to prevent IO issues. You have to defrag both the VM Host and the VM slice on a regular basis in order to maintain acceptable performance.
Here's some VMWARE testing on it..
http://www.vmware.com/files/pdf/SQLServerWorkloads.pdf
Granted, they do not compare it to physical machines. But, you could probably do similar testing with the tools they used for your environment.
We currently run SQL Server 2005 in a VMWARE environment. BUT, it is a very lightly loaded database and it is great. Runs with no problems.
As most have pointed out, it will depend on your database load.
Maybe you can convince the IT Department to do some good testing before blindly implementing.
No, I can't point to any specific tests or anything like that, but I can say from experience that putting your production database server on a virtual machine is a bad idea, especially if it has a large load.
It's fine for development. Possibly even testing (on the theory that if it runs fine under load on virtual box, it's going to run fine on prodcution) but not in production.
It's common sense really. Do you want your hardware running two operating systems and your sql server or one operating system and sql server?
Edit:
My experience biased my response. I have worked with large databases under heavy constant load. If you have a smaller database under light load, virtualization may work fine for you.
There is some information concerning this in Conor Cunningham's blog article Database Virtualization - The Dirty Little Secret Nobody is Talking About.... To quote:
Within the server itself, there is suprisingly little knowledge of a lot of things in this area that are important to performance. SQL Server's core engine assumes things like:
all CPUs are equally powerful
all CPUs process instructions at about the same rate.
a flush to disk should probably happen in a bounded amount of time.
And the post goes on elaborating these issues somewhat further also. I think a good read considering the scarcity of available information considering this issue in general.
Note there are some specialty virtualization products out there that are made for databases that might be worth looking into instead of a general product like VMWare.
Our company (over 200 SQL servers) is currently in the process of deploying HP Polyserve on some of our servers:
HP PolyServe Software for Microsoft SQL Server enables multiple Microsoft SQL Server instances to be consolidated onto substantially fewer servers and centralized SAN storage. HP PolyServe's unique "shared data" architecture delivers enterprise class availability and virtualization-like flexibility in a utility platform.
Our primary reason for deploying it is to make hardware replacement easier: add the new box to the "matrix", shuffle around where each SQL instance resides (seamlessly), then remove the old box. Transparent to the application teams, because the SQL instance names don't change.
Old Question with Old Answers
The answers in this thread are years old. Most of the negative points in this entire thread are technically still correct but much less relevant. The overhead cost of virtualization and SAN’s is much less a factor now than it used to be. A correctly configured Virtualization host, guest, network, and SAN can provide good performance with the benefits of virtualization and operational flexibility including good recovery scenarios that are only provided by being virtual.
However, in the real world it only takes one minor configuration detail to bring the whole thing to its knees. In practice your biggest challenge with virtual SQL servers is convincing and working with the people responsible for the virtualization to get it set up just right.
Irony, in 100 percent of the cases where we took production off of the virtualization and moved it back to dedicated hardware performance went through the roof on the dedicated hardware. In all of these cases it was not the virtualization but the way it was setup. By going back to dedicated hardware we actually proved that the virtualization would have been a much better use of resources by factors of 5 or more. Modern software is usually designed to scale out across nodes so virtualization works to your advantage on that front as well.
The biggest concern to me when virtualising software is normally licensing.
Here's an article on it for MS SQL. Not sure about your situation so can't pick out any salient points.
http://www.microsoft.com/sql/howtobuy/virtualization.mspx
SQL Server is supported in a virtual environment. In fact I would recommend it seeing that one of the licensing options is per socket. This means you can put as many SQL Server instances in a virtualized (e.g. Windows 2008 Server Datacenter) system as you like and pay only per processor socket the machine has.
It's better than that because DataCenter is licensed per socket with unlimited Virtual machine licenses as well.
I'd recommend clustering your Hyper-V on two machines however so that if one fails the other can pick up the slack.
I would think that the possibility of something bad happening to the data would be too great.
As a dead simple example, let's say you ran a SQL Server box in Virtual Server 2005 R2 and undo disks were turned on (so, the main "disk" file stays the same and all changes are made to a separate file which can be purged or merged later). Then something happens (usually, you run into the 128GB limit or whatever the size is) and some middle of the night clueless admin has to reboot and figures out he can't do so until he removes the undo disks. You're screwed - even if he keeps the undo disk files for later analysis the possibilities of merging the data together is pretty slim.
So echoing the other posts in this thread - for development it's great but for production it's not a good idea. Your code can be rebuilt and redeployed (that's another thing, VM's for source control aren't a good idea either) but your live production data is way more important.
Security issues that can be introduced when dealing with Vitalization should also be considered. Virtualization Security is a good article by PandaLabs that covers some of the concerns.
You are looking at this from the wrong angle. First, you are not going to find White Papers from Vendors why you should "not" virtualize or why you should Virtualize.
Every environment is different and you need to do what works in your Environment. With that said, there are some servers that are perfect for virtualization and there are some that should not be virtualized. For example, if your SQL Server/s are doing millions and millions of transactions per second, like if your server was located at the NYSE or the NASDAQ and millions and millions of dollars depend on it, you probably should not virtualize it. Make sure you understand the ramifications of virtualizing an SQL server.
I've seen where people virtualize SQL over and over just because Virtualization is cool. Then complain later on when the VM server does not perform as expected.
What you need to do is set a benchmark, fully test the solution you want to deploy and show what it can and can't do so you don't run into any surprises. Virtualization is great, it is good for the envronment and saves through consolidation, but you need to show why your supervisors why you should not virtualize your SQL Servers and only you can do this.