Storing binary files in sql server - sql-server

I'm writing an mvc/sql server application that needs to associate documents (word, pdf, excel, etc) with records in the database (supporting sql server 2005). The consensus is it's best to keep the files in the file system and only save a path/reference to the file in the database. However, in my scenario, an audit trail is extremely important. We already have a framework in place to record audit information whenever a change is made in the system so it would be nice to use the database to store documents as well. If the documents were stored in their own table with a FK to the related record would performance become an issue? I'm aware of the potential problems with backups/restores but would db performance start to degrade at some point if the document tables became very large? If it makes any difference I would never expect this system to need to service anywhere near 100 concurrent requests, maybe tens of requests.

Storing the files as blob in database will increase the size of the db and will definitely affect the backups which you know and is true.
There are many things of consideration whether the db and code server are same.
Because it happens to be code server requests and gets data from db server and then from code server to client.
If the file sizes are too large I would say go for the file system and save file paths in db.
Else you can keep the files as blog in db, it will definitely be more secure, as well as safe from virus, etc.

Related

Stealing my information back

TL;DR: My POS uses Sybase Advantage Database Server to store my sales data, and I'd like to access it, but I only have the backup files.
I own a small business with "advanced" POS software, which has the only copy of all my sales data ever. They have some backup scheme, but they're unwilling to divulge any details. There's also an automatic daily local backup routine, but because this is a POS and there are certain laws about deleting data, I am not allowed (nor do I have the software required) to restore from backup even to check that it works. I asked the support guy when the last time he had to restore from backup was, and he said "don't worry, we don't ever need it".
Naturally, I'm worried.
I'll note at this point that I am required by law to keep this data, and should I fail to do so for any reason I may personally face massive fines in the range of multiple millions. I'd like to avoid that.
Additionally to keeping the data, and verifying that the backups contain the data I so need to keep, I'd also like to create reports. The POS vendor claims that it can create any report I'd ever need, but every single time I've asked them about a report it either contained wrong data, crashed, exported unreadable files (to which their reply was that the files are fine, my [insert relevant file reader] is broken), or simply didn't exist (to which their reply is usually something like "you don't need that report anyway"). I asked about accessing a copy of the database myself, and they said they can't allow that. My only recourse is to pay them tens of thousands for developing and testing the report. What report do I want, you ask?
SELECT * FROM SALES
To create this simple report, I need to migrate my data from the Sybase Advantage Database Server backup files into a format I can use, e.g. a MySQL database, but all the migration tools I've found require access to a working database server.
How can I get my data out of these backups?
It's actually a complete Sybase iAnywhere database (which uses .db for the data and .log for the transaction log).
So you should be looking for Sybase iAnywhere or Sybase SQLAnywhere drivers and tools.
SAP / Sybase has a developer website here: http://scn.sap.com/community/sql-anywhere

Merge multiple Access database into one big database

I have multiple ~50MB Access 2000-2003 databases (MDB files) that only contain tables with data. The data-databases are located on a server in my enterprise that can take ~1-2 second to respond (and about 10 seconds to actually open the 50 MDB file manually while browsing in the file explorer). I have other databases that only contain forms. Most of those forms-database (still MDB files) are actually copied from the server to the client (after some testing, the execution looks smoother) before execution with a batch file. Most of those forms-databases use table-links to fetch the data from the data-databases.
Now, my question is: is there any advantage/disadvantage to merge all data-databases from my ~50MB databases to make one big database (let's say 500MB)? Will it be slower? It would actually help to clean up my code if I wouln't have to connect to all those different databases and I don't think 500MB is a lot, but I don't pretend to be really used to Access by any mean and that's why I'm asking. If Access needs to read the whole MDB file to get the data from a specific table, then it would be slower. It wouldn't be really that surprising from Microsoft, but I've been pleased so far with MS Access database performances.
There will never be more than ~50 people connected to the database at the same time (most likely, this number won't in fact be more than 10, but I prefer being a little bit conservative here just to be sure).
The db engine does not read the entire MDB file to get information from a specific table. It must read information from the system tables (hidden tables whose names start with MSys) to determine where the data you need is stored. Furthermore, if you're using a query to retrieve information from the table, and the db engine can use an index to determine which rows satisfy the query's WHERE clause, it may read only those rows from the table.
However, you have issues with your network's performance. When those lead to dropped connections, you risk corrupting the MDB. That is why Access is not well suited for use in wide area networks or with wireless connections. And even on a wired LAN, you can suffer such problems when the network is flaky.
So while reducing the amount of data you pull across the network is a good thing, it is not the best remedy for Access on a flaky network. Instead you should migrate the data to a client-server db so it can be kept safe in spite of dropped connections.
You are walking on thin ice here.
Access will handle your scenario, but is not really meant to allow so many concurrent connections.
Merging everything in a big database (500mb) is not a wise move.
Have you tried to open it from a network location?
As far as I can suggest, I will use a backend SqlServer Express to merge all the tables in a single real client-server database.
The changes required by client mdb front-end should not be very pervasive.

Should I store file in database or just the location to that file?

Which is the better practice to store file? Directly store the file in database or just the location to that file?
Avoid storing files in your database. Most don't deal with them well.
It depends. You need to consider several things.
If you have a mickey mouse freeware database, meaning that it does not handle blobs appropriately (reads the blobs on every SELECT; does not store the blobs in a separate physical structure to the row; very slow with blobs; etc)
keep the files outside, store only the location
manually deal with the syncing of row.location vs the file system
If you have an enterprise SQL Platform, it is no problem at all to keep the blobs inside the database. In fact, retrieval is faster. These do not read the blobs on every SELECT, they are stored in a separate physical structure to the rows. The one extra read to get the blob if the SELECT requests it, is not a "performance problem".
The PAGESIZE in genuine SQL databases can be set as 2k; 4k; 8k; or 16k.
2k is perfect for OLTP (small rows, small Transactions: you do not want to move 8K on every IO operation)
larger sizes are relevant based on how much OLAP you cater for
in your case, the average size of the files
there will be some waste in the unused portion of the last page, per row/blob.
The disadvantage of keeping the blobs in the database is, your database backups will be significantly larger.
Some enterprise databases (eg. SAP/Sybase) recognise that a page has not changed, and excludes it from the incremental backups
others have no incremental database backups.
The advantage of keeping the blobs in the database is:
data and referential integrity. You will not have the problem of having the rows that are out of synch with the blobs
the blobs are included in the backup: otherwise, upon a restore, the task of syncing the restored database with the restored files is a major problem.
I completed an assignment last year, where the customer had 130GB of data in the db, and 700GB of documents stored outside the db. After ten years of problems, they bit the bullet, and moved the documents into the db.
Guess what, what was supposed to be a simple job (long but simple, because the references were supposed to be absolutely correct), ended up being massive, because there were so many (a) duplicates, and (b) invalid references.
The resulting database was 630GB, there were 100GB of dupes. 2K pagesize.
Responses to Comments
Slash or Backslash
Easy.
In the database, store slash only.
You need a way of identifying the target system, and an IsWindoze indicator. It should be higher up in the table hierarchy, not at the level where the Filename is located.
Whenever you report or display the Filename column, if IsWindoze, change the slashes to backslashes.
You will have a similar problem with the DriveLetter and colon D:, which Unix does not have. Allow it only if IsWindoze.
Late answer: it depends on your engine.
A page size of 2k hasn't been used since the 1990s for SQL Server. Oracle defaults to 8K, SQL Server is 8K. Only Sybase AFAIK is still in the last century.
SQL Server now offers FILESTREAM which combines the best of both worlds, as Oracle has done for longer with BFILE
SQL Server and Oracle offer on disk and backup compression
I'm sure PostgresSQL at least offers similar features.
Note: this is mainly to offer alternatives to PerformanceDBA's FUD
The preferred method is to store the file in the filesystem and store the location of the file in the database. The reasoning for this has to do with how databases physically allocate space on disk (usually in 8k or 16k chunks). Dropping large files in there causes your database to use different mechanisms to store the files (SQL Server calls this row overflow data). Typically these kind of pages are located out of the normal table, so every logical read for a row results in two physical reads on disk. Needless to say, this isn't good for performance.

Best strategy for storing documents in SQL Server 2008

One of our teams is going to be developing an application to store records in a SQL2008 database and each of these records will have an associated PDF file. There is currently about 340GB of files, with most (70%) being about 100K, but some are several Megabytes in size. Data is mostly inserted and read, but the files are updated on occasion. We are debating between the following options:
Store the files as BLOBs in the database.
Store the files outside the database and store the paths in the database.
Use SQL2008's Filestream feature to store the files.
We have read the Micrsoft best practices regarding filestream data, but since the files vary in size, we are not sure which path to choose. We are leaning toward option 3 (filestream), but have some questions:
Which architecture would you choose given the amount of data and file sizes noted above?
Data access will be done using SQL authentication, not Windows authentication, and the web server will likely not be able to access the files using Windows API. Would this make filstream perform worse than the other two options?
Since the SQL backups include the filestream data, this would lead to very large database backups. How do others handle backing up databases with a large amount of filestream data?
OK, here we go. Option 2 is a really bad idea - you end up with untestable integrity constraints and backups that are not guaranteed to be consistent per definition because you can not take point in time backups. Not a problem in MOST scenarios, it turns into one the moment you have a more complicated (point in time) recovery.
Options 1 and 3 are pretty equal, albeit with some implications.
Filestream can use a lot more disc space. Basically, every version has a guid, if you make updates the old files stay around until the next backup.
OTOH the files do not count as db size (express edition - not against the 10gb limit should you use it) and access is further down possible using a file share. This is added flexibility.
In database has the most limited options regarding access (no way for the web server to just open the file after getting the path from the sql - it has to funnel the complete file through the sql protocol layer) but has advantages in regards of having less files (numbers). Putting the blobs into a separate table and that one a separate set of spindles may be strategically a good idea.
Regarding your questions:
1: I would go with in database storage. Try out both - filestream and not. As you use the same API anyway, this is a simple change in the table definition.
2: Yes, worse than direct file access, but it would be more protected than direct file access. Otherwise I do not think filestream and blob make a significant difference.
3: where do you have a huge backup here? Sorry to ask, but your 340gb is not exactly a large database. And you need to back it up ANYWAY. Better do it in one consistent state, which is what you achieve with db storage. Plus integrity (no one accidentally deleting unused documents without cleaning up the database). The DB is not significantly larger than doing that split, and it is a simple one place backup.
At the end, the question is db integrity and ease of backing things up. Win for SQL Server unless you get large - and this means 360 terabyte of data.
Store the files outside the database and store the paths in the database.
because it takes too much space to store files in the database.
I would definitely recommend (3) - this is the sort of scenario that this feature is specifically built to handle, and it is handled very well in my opinion.
This white paper has lots of useful information - http://msdn.microsoft.com/en-us/library/cc949109(SQL.100).aspx - and from a security point of view mentions that...
There are two security requirements for using the FILESTREAM feature. Firstly, SQL Server must be configured for integrated security. Secondly, if remote access will be used, then the SMB port (445) must be enabled through any firewall systems.
With regard to Backups, see the accepted answer to this question - SQL Server FILESTREAM limitation
I've used a Index/Content method that you haven't listed but it might help. You have a table of files that are stored as a blob of binary code with a unique id or row number. The next SQL table will provide the index, the name of the file, the path to it, keywords, file type, file size, check sum... what ever you need. This is the best I have have seen to store files for working with thousands of uploaded documents. The index is required to view the file as it would just be binary text to the user if they have no idea what the file type is. We store the data in 2 separate databases to allow the index on one server and the file store on multiple servers for easy expansion. At that point the index table/database contains the name or key to the server the file is on. If the user has access to read that particular index table, then they have access to the file.
This scenario is easy: the FILESTREAM recomendation said that is best when the files are (on average) larger than 1MB, wich is not your case, for smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.
Since you will be accesing the files directly from SQL Server and not from filesystem then you should store it using BLOBs.
Read When to Use FILESTREAM: http://technet.microsoft.com/en-us/library/bb933993%28v=sql.105%29.aspx
Have you looked at RBS (Remote Blob Storage) solution? If you use the Filestream RBS provider, it will internally keep your blobs as Filestream files or varbinary(max) values, depending on what gets better performances based on the blob size.
Remote BLOB Store Provider Library Implementation Specification
SQL Remote Blob Storage Team Blog

Using SQL Server as Image store

Is SQL Server 2008 a good option to use as an image store for an e-commerce website? It would be used to store product images of various sizes and angles. A web server would output those images, reading the table by a clustered ID. The total image size would be around 10 GB, but will need to scale. I see a lot of benefits over using the file system, but I am worried that SQL server, not having an O(1) lookup, is not the best solution, given that the site has a lot of traffic. Would that even be a bottle-neck? What are some thoughts, or perhaps other options?
10 Gb is not quite a huge amount of data, so you can probably use the database to store it and have no big issues, but of course it's best performance wise to use the filesystem, and safety-management wise it's better to use the DB (backups and consistency).
Happily, Sql Server 2008 allows you to have your cake and eat it too, with:
The FILESTREAM Attribute
In SQL Server 2008, you can apply the FILESTREAM attribute to a varbinary column, and SQL Server then stores the data for that column on the local NTFS file system. Storing the data on the file system brings two key benefits:
Performance matches the streaming performance of the file system.
BLOB size is limited only by the file system volume size.
However, the column can be managed just like any other BLOB column in SQL Server, so administrators can use the manageability and security capabilities of SQL Server to integrate BLOB data management with the rest of the data in the relational database—without needing to manage the file system data separately.
Defining the data as a FILESTREAM column in SQL Server also ensures data-level consistency between the relational data in the database and the unstructured data that is physically stored on the file system. A FILESTREAM column behaves exactly the same as a BLOB column, which means full integration of maintenance operations such as backup and restore, complete integration with the SQL Server security model, and full-transaction support.
Application developers can work with FILESTREAM data through one of two programming models; they can use Transact-SQL to access and manipulate the data just like standard BLOB columns, or they can use the Win32 streaming APIs with Transact-SQL transactional semantics to ensure consistency, which means that they can use standard Win32 read/write calls to FILESTREAM BLOBs as they would if interacting with files on the file system.
In SQL Server 2008, FILESTREAM columns can only store data on local disk volumes, and some features such as transparent encryption and table-valued parameters are not supported for FILESTREAM columns. Additionally, you cannot use tables that contain FILESTREAM columns in database snapshots or database mirroring sessions, although log shipping is supported.
Check out this white paper from MS Research (http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2006-45)
They detail exactly what you're looking for. The short version is that any file size over 1 MB starts to degrade performance compared to saving the data on the file system.
I doubt that O(log n) for lookups would be a problem. You say you have 10GB of images. Assuming an average image size of say 50KB, that's 200,000 images. Doing an indexed lookup in a table for 200K rows is not a problem. It would be small compared to the time needed to actually read the image from disk and transfer it through your app and to the client.
It's still worth considering the usual pros and cons of storing images in a database versus storing paths in the database to files on the filesystem. For example:
Images in the database obey transaction isolation, automatically delete when the row is deleted, etc.
Database with 10GB of images is of course larger than a database storing only pathnames to image files. Backup speed and other factors are relevant.
You need to set MIME headers on the response when you serve an image from a database, through an application.
The images on a filesystem are more easily cached by the web server (e.g. Apache mod_mmap), or could be served by leaner web server like lighttpd. This is actually a pretty big benefit.
For something like an e-commerce web site, I would be moe likely to go with storing the image in a blob store on the database. While you don't want to engage in premature optimization, just the benefit of having my images be easily organized alongside my data, as well as very portable, is one automatic benefit for something like ecommerce.
If the images are indexed then lookup won't be a big problem. I'm not sure but I don't think the lookup for file system is O(1), more like O(n) (I don't think the files are indexed by the file system).
What worries me in this setup is the size of the database, but if managed correctly that won't be a big problem, and a big advantage is that you have only one thing to backup (the database) and not worry about files on disk.
Normally a good solution is to store the images themselves on the filesystem, and the metadata (file name, dimensions, last updated time, anything else you need) in the database.
Having said that, there's no "correct" solution to this.

Resources