I'm a developer at heart - but every now and then, a customer doesn't have a decent DBA to deal with these issues, so I'm called in to decide....
What are your strategies / best practices when it comes to dealing with a reasonably sized SQL Server database (anything larger than Northwind or AdventureWorks) - do you use multiple filegroups? If so: how many? And why?
What are your criteria to decide when to move away from the "one filegroup for everything" approach:
database size?
database complexity?
availability / reliability requirements?
what else?
If you use multiple file groups, how many do you use? One for data, one for index, one for log? Several (how many) for data? What are your reasons for your choice - why do you use that exact number of filegroups :-)
The Microsoft trained and best practice methodology is as follows:
Log files are placed on a separate physical drive
Data files are placed on a separate physical drive
Multiple file groups: When a particular table is extremely big. Often the case in transactional database (Separate Physical Drive)
Multiple file groups: When using ranges or when wanting to split lookup data into a read-only database file (Separate Physical Drive)
Keep in mind that an MDF technically works similarly to a hard drive partition when it comes to storing data. The MDF is a randomly read file, whereas the LDF is a sequentially read file. Therefore splitting them into separate drives causes a huge performance gain, unless running solid state drives, in which case the gain is still there.
There's at least ONE good reason for having multiple (at least two) file groups in SQL Server 2008 : if you want to use the FILESTREAM feature, you have to have a dedicated and custom filegroup for your FILESTREAM data :-)
Marc
Maintaining multiple filegroups helps you reduce the I/O burden. It also allows you storage flexibility where you can back up a filegroup easily rather than a single file and separate them into an individual disk drive per file group.
Generally you should just have one Primary Filegroup and one log file against that.
Sometimes when you have very static data, you can create a SECOND filegroup that contains this static data. You can then make the filegroup READONLY which improves your performance. After all, this is pretty static data. It's not worth it if you have a low number of readonly rows (eg. lookup table values). But for some stuff (eg. archived content that can still be read in) then this might be a great option.
I got the idea from this blog post.
HTH.
I've worked on a good range of DBs, and the only time we've used filegroups was when a disk was running short on space, and we had to create a new file group on another spindle. I'm sure there are good performance reasons why that's not ideal, but that was the reality.
among other reasons additional filegroups make sense if you want to partition a table. and that makes sense if there are many rivaling reads with dissimilar where-conditions of that table. you can configure each partition to reflect one such where-condition and to be located on a different disk, thereby sending each read to another disk, thus parallel reads and less conflict.
Related
I need a bit of a help with the following.
Note: in the following scenario, I do not have access to the application's source code, therefore I can only make changes at the database level.
Our database uses dbo.[BLOB] to store all kinds of files and documents. The table uses an IMAGE (yeah, obsolete) data type. Since this particular table is growing quite fast, I was thinking to implement some archiving feature.
My idea is to move all files older than X months to a second database, and then somehow link from the dbo.[BLOB] table to the external/archiving database.
Is this even possible? The goal is to reduce the database size, in order to improve backup and query performance.
Any ideas and hints much appreciated.
Thanks.
Fabian
There are 2 features to help you with backup speed and database size in this case:
Filestream will allow you to store BLOBS as files on the file system instead of in database file. It complicates backup scenario, you have to backup both database and files but you get smaller database file along with faster access time to documents. It is much faster to read file from filesystem than from blob column. Additionally filestream allows for files bigger than 2GB.
Partitioning will split table into smaller chunks on physical level. This way you do not need to access application code to change where particular rows are stored physically and decide which data needs to be accessed fast and put it on SSD drive and which can land on slower archive. This way you can have more frequent backups on current partition, while less frequent on archive.
Prior to SQL Server 2016 SP1 - this feature was available in Enterprise version only. For SQL Server 2016 SP1 this is available in all editions.
In your case most likely you should go with filestream first.
W/o modifying the application you can do, basically, nothing. You may try to see if changing the column type will be tolerated by the application (very unlikely, 99.99% it will break the app) and try to use FILESTREAM, but even if you succeed it does not give much benefits (backup size will be the same, for example).
A second thing you can try is to replace the table with a view, using INSTEAD OF triggers for updates. It is still very likely to break the application (lets say 99.98%). The goal would be to have a distributed partitioned view (or cross DB partitioned view) which presents to the application an unified view of the 'cold' and 'hot' data. Is complex, error prone, but it will reduce the size of the backups (as long as data is moved from hot to cold and cold data is immutable, requiring few backups).
The goal is to reduce the database size, in order to improve backup and query performance.
To reduce the backup size, as I explained above, you can do, basically, nothing. But performance you need to investigate it and address it appropriately, based on your findings. Saying the the database is slow 'because of BLOBs' is hand-waving.
This is most likely a ridiculous question, but I'm intrigued by the thought so I'll ask anyway. Is there any performance or benefit (outside of disaster recovery handling) of having a database on multiple filegroups stored on the same physical drive?
More specifically, if I create a secondary filegroup ONLY for full-text indexes on the same physical drive, is it beneficial? Could it be a bottleneck?
Log files in my situation are stored on a separate physical drive from data files.
It shouldn't provide any additional benefit, except that with separate file groups you could, potentially, split out your backups. As far as the I/O on the same drive, you won't gain much if anything by doing this, so if you're considering it strictly for an I/O performance reason, I would suggest holding off until you can budget separate spindles.
Multiple files have the benefit of reducing allocation contention (PFS latch contention). Really really really fast IO subsystems (eg. SSD drives) can expose this problem and require mitigation by adding more files to the database. There are more details on this at How many files should a database have? or on Benchmarking: Multiple data files on SSDs.
Multiple filegroups imply multiple files, but at the same time a hot table will not benefit from multiple filegroups because the hot spot will be, again, in a single filegroup (unless, of course, the hot filegroup is itself split into multiple files). So I would say that filegroups are to be used solely for administration purposes (eg. piece meal restore).
No, it does not matter. All it does is make it easier to move them later if you want to.
Which is the better practice to store file? Directly store the file in database or just the location to that file?
Avoid storing files in your database. Most don't deal with them well.
It depends. You need to consider several things.
If you have a mickey mouse freeware database, meaning that it does not handle blobs appropriately (reads the blobs on every SELECT; does not store the blobs in a separate physical structure to the row; very slow with blobs; etc)
keep the files outside, store only the location
manually deal with the syncing of row.location vs the file system
If you have an enterprise SQL Platform, it is no problem at all to keep the blobs inside the database. In fact, retrieval is faster. These do not read the blobs on every SELECT, they are stored in a separate physical structure to the rows. The one extra read to get the blob if the SELECT requests it, is not a "performance problem".
The PAGESIZE in genuine SQL databases can be set as 2k; 4k; 8k; or 16k.
2k is perfect for OLTP (small rows, small Transactions: you do not want to move 8K on every IO operation)
larger sizes are relevant based on how much OLAP you cater for
in your case, the average size of the files
there will be some waste in the unused portion of the last page, per row/blob.
The disadvantage of keeping the blobs in the database is, your database backups will be significantly larger.
Some enterprise databases (eg. SAP/Sybase) recognise that a page has not changed, and excludes it from the incremental backups
others have no incremental database backups.
The advantage of keeping the blobs in the database is:
data and referential integrity. You will not have the problem of having the rows that are out of synch with the blobs
the blobs are included in the backup: otherwise, upon a restore, the task of syncing the restored database with the restored files is a major problem.
I completed an assignment last year, where the customer had 130GB of data in the db, and 700GB of documents stored outside the db. After ten years of problems, they bit the bullet, and moved the documents into the db.
Guess what, what was supposed to be a simple job (long but simple, because the references were supposed to be absolutely correct), ended up being massive, because there were so many (a) duplicates, and (b) invalid references.
The resulting database was 630GB, there were 100GB of dupes. 2K pagesize.
Responses to Comments
Slash or Backslash
Easy.
In the database, store slash only.
You need a way of identifying the target system, and an IsWindoze indicator. It should be higher up in the table hierarchy, not at the level where the Filename is located.
Whenever you report or display the Filename column, if IsWindoze, change the slashes to backslashes.
You will have a similar problem with the DriveLetter and colon D:, which Unix does not have. Allow it only if IsWindoze.
Late answer: it depends on your engine.
A page size of 2k hasn't been used since the 1990s for SQL Server. Oracle defaults to 8K, SQL Server is 8K. Only Sybase AFAIK is still in the last century.
SQL Server now offers FILESTREAM which combines the best of both worlds, as Oracle has done for longer with BFILE
SQL Server and Oracle offer on disk and backup compression
I'm sure PostgresSQL at least offers similar features.
Note: this is mainly to offer alternatives to PerformanceDBA's FUD
The preferred method is to store the file in the filesystem and store the location of the file in the database. The reasoning for this has to do with how databases physically allocate space on disk (usually in 8k or 16k chunks). Dropping large files in there causes your database to use different mechanisms to store the files (SQL Server calls this row overflow data). Typically these kind of pages are located out of the normal table, so every logical read for a row results in two physical reads on disk. Needless to say, this isn't good for performance.
I have a design decision to make regarding documents uploaded to my web site: I can either store them on my file server somewhere, or I can store them as a blob in my database (MSSQL 2005). If it makes any difference to the design decision, these documents are confidential and must have a certain degree of protection.
The considerations I've thought of are:
Storing on the file server makes for HUUUUUUUGE numbers of files all dumped in a single directory, and therefore slower access, unless I can work out a reasonable semantic definition for a directory tree structure
OTOH, I'm guessing that the file server can handle compression somewhat better than the DB... or am I wrong?
My instincts tell me that the DB's security is stronger than the file server's, but I'm not sure if that's necessarily true.
Don't know how having terabytes of blobs in my DB will affect performance.
I'd very much appreciate some recommendations here. Thanks!
In SQL Server 2005, you only have the choice of using VARBINARY(MAX) to store the files inside the database table, or then keep them outside.
The obvious drawback of leaving them outside the database is that the database can't really control what happens to them; they could be moved, renamed, deleted.....
SQL Server 2008 introduces the FILESTERAM attribute on VARBINARY(MAX) types, which allows you to leave the files outside the database table, but still under transactional control of the database - e.g. you cannot just delete the files from the disk, the files are integral part of the database and thus get copied and backed up with it. Great if you need it, but it could make for some huge backups! :-)
The SQL Server 2008 launch presented some "best practices" as to when to store stuff in the database directly, and when to use FILESTREAM. These are:
if the files are typically less than 256 KB in size, the database table is the best option
if the files are typically over 1 MB in size, or could be more than 2 GB in size, then FILESTREAM (or in your case: plain old filesystem) is your best choice
no recommendation for files between those two margins
Also, in order not to negatively impact performance of your queries, it's often a good idea to put the large files into a separate table alltogether - don't have the huge blobs be part of your regular tables which you query - but rather create a separate table, which you only ever query against, if you really need the megabytes of documents or images.
So that might give you an idea of where to start out from!
I strongly suggest you to consider the filesystem solution. The reasons are:
you have better access to the files (precious in case of debugging), meaning that you can use regular console-based tools
you can quickly and easily take advantage of the OS to distribute the load, for example using a distributed filesystem, add redundancy via a hardware RAID etc.
you can take advantage of the OS access control lists to enforce permissions.
you don't clog your database
If you are worried about large amounts of entries in your directories, you can always create a branching schema. for example:
filename : hello.txt
filename md5: 2e54144ba487ae25d03a3caba233da71
final filesystem position: /path/2e/54/hello.txt
There's a LOT of "it depends" behind this popular subject. Since you say the documents are sensitive and confidential, off the cuff I'd go with storing in the database. Here are a few reasons:
Potentially better security. It is often easier to hack a file system than a database.
Better volume control. Thousands of files in one folder can strain an OS, where a database can take millions of rows in one table without blinking.
Better searching and scanning. Add categorizing columns when you load the data, or try out full text indexing to scan the actual documents.
Backups may be more efficient -- just add another database to your backup plan, and you're covered (once you work out space details, of course). And those backup files are another layer of obfuscation on anyone trying to get at your sensitive documents.
SQL Server 2008 has data compression options that may help here. That, or have the application do it? (More security through obfuscation, perhaps)
SQL Server 2008 also has the filestream data type, which may help here, but I'm not familiar enough with it to give a recommendation for your situation.
I'm creating a new DB and have a bunch of static data that won't change. If it does, it will be a manual process AND it will happen very rarely.
This data is a mix of varchars and Geographies.
I'm guessing it could be around 100K or so in total, over 4 or so tables.
Questions
Should I put these on a READ ONLY filegroup
Can I create the tables in the designer and define the filegroup during creation? Or is it only possible via a script?
Once the data is in the table (on a read only filegroup), can I change it later? Is it really hard to do that?
thanks.
It is worth it for VLDB (very large databases) for assorted reasons.
For 100,000 rows or 100 KB, I wouldn't bother.
This SQL Server support engineering team article discusses one of the associated "urban legends".
There is another one (can't find it) where you need 300 GB - 1B of data before you should consider multiple files/filegroups.
But, to answer specifically
Personal choice (there is no hard and fast rule)
Yes (edit:) In SSMS 2005, design mode, go to Indexes/Key, "data space specfication". The data lives where the clustered index is. WIthout a clustered index, then you can only do it via CREATE TABLE (..) ON filegroup
Yes, but You'll have to ALTER DATABASE myDB MODIFY FILEGROUP foo READ_WRITE with the database in single user exclusive mode
It is unlikely to hurt to put the data in to a read only space but I am unsure you will gain significantly. A read-only file group (or tablespace in Oracle) can give you 2 advantages; less to back-up each time a full backup is taken and a higher level of security over the data (e.g. it cannot be changed by a bug, accessing the DB via another tool, etc). The backup advantage is most true with larger DBs where backup windows are tight so putting a small amount of effort into excluding file groups is valuable. The security one depends on the nature of the site, data, etc. (if you do exclude the read-only space from regular backups make sure you get a copy on any retained backup tapes. I tend to backup up read-only spaces once a month.)
I am not familiar with designer.
Changing to and from read only is not onerous.
I think anything you read here is likely to be speculation, unless you have any evidence that it's been actually tried and recommended - to me it looks like a novel but unlikely idea. Do you have some reason to suspect that conventional practices will be unsatisfactory? It should be fairly easy to just try it and find out. Post your results if you get a chance.