Efficient Storage of rows with large fields in Azure - sql-server

Currently putting together a base POC of architecture for an application that will be storing large amounts of records and each record will have one field that contains a few thousand characters.
e.g.
TableID int
Field1 nvarchar(50)
Field2 nvarchar(50)
Field2 nvarchar(MAX)
This is all being hosted in Azure. We have one webjob that does the work to obtain the data and populate it into the data store and then another webjob comes through periodically and processes the data.
Currently the data is just stored in an Azure SQL Database. I'm just worried once the record count turns into the many millions it's going to be incredibly inefficient to store/process/retrieve the data this way.
Advice required on the best way to store this in Azure. Wanted to start trying the fact that we keep the rows in Azure SQL but the large field's data is pushed into another repository (e.g. Data lake, DocumentDB) and has a reference back to the SQL record therefore the SQL calls are still lean and big data is stored somewhere else. Is this a clean manor of doing it or am I totally missing something?

Azure Table Storage can help with this solution - it is a NoSQL KeyValue store. Each entity can be up to 1MB in size. You could also use individual blobs as well. There is a design guide that includes a full description of how to design Table Storage solutions for scale - including patterns for using Table Storage along with other repositories see Table Design Guide
https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/

Related

How to store files effectively in SQL server [duplicate]

It's an old question I know, but with SQL Server 2012 is it finally ok to store files in the database, or should they really be kept in the filesystem with only references to them in the database?
If storing them in the database is considered acceptable these days, what is the most effective way to do it?
I'm planning to apply encryption so I appreciate processing will not be lightning fast.
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee photo in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee photo, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!
There's still no simple answer. It depends on your scenario. MSDN has documentation to help you decide.
There are other options covered here. Instead of storing in the file system directly or in a BLOB, you can use the FileStream or File Table in SQL Server 2012. The advantages to File Table seem like a no-brainier (but admittedly I have no personal first-hand experience with them.)
The article is definitely worth a read.
You might read up on FILESTREAM. Here is some info from the docs that should help you decide:
If the following conditions are true, you should consider using FILESTREAM:
Objects that are being stored are, on average, larger than 1 MB.
Fast read access is important.
You are developing applications that use a middle tier for application logic.
For smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.

Blob Storage of images and text

I have a blob (storage account) that is housed on Azure. I also have a sql server table that is housed on Azure. I have a couple of questions
Is it possible to create a join between the blob and the table
Is it possible to store all of the information in the blob?
The table has address information on it and I wanted to be able to pull that information from the table and associate it or join it to the proper image by the ID in the sql table (if that is the best way)
Is it possible to create a join between the blob and the table?
No.
Is it possible to store all of the information in the blob?
You possibly could (by storing the address information as blob metadata) but it is not recommended because then you would lose searching capability. Blob storage is simply an object store. You won't be able to query on address information.
The table has address information on it and I wanted to be able to
pull that information from the table and associate it or join it to
the proper image by the ID in the sql table (if that is the best way)
Recommended way of doing this is storing the images in blob storage. Each blob in blob storage gets a unique URL (https://account.blob.core.windows.net/container/blob.png) that you can store in your database along with other address fields (e.g. create a column called ImageUrl and store the URL there).
Azure Storage (blob, in your case) and SQL Server are completely separate, independent data stores. You cannot do joins, transactions, or really any type of query, across both at the same time.
What you store in each is totally up to you. Typically, people store searchable/indexable metadata within a database engine (such as SQL Server in your case), and non-searchable (binary etc) content in bulk storage (such as blobs).
As far as "best way"? Not sure what you're looking for, but there is no best way. Like I said, some people will store anything searchable in their database. On top of this, they'd store a url to specific blobs that are related to that metadata. There's no specific rule about doing it this way, of course. Whatever works for you and your app...
Note: Blobs have metadata as well, but that metadata is not indexable; it would require searching through all blobs (or all blobs in a container) to perform specific searches.

Storing files in SQL Server

It's an old question I know, but with SQL Server 2012 is it finally ok to store files in the database, or should they really be kept in the filesystem with only references to them in the database?
If storing them in the database is considered acceptable these days, what is the most effective way to do it?
I'm planning to apply encryption so I appreciate processing will not be lightning fast.
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee photo in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee photo, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!
There's still no simple answer. It depends on your scenario. MSDN has documentation to help you decide.
There are other options covered here. Instead of storing in the file system directly or in a BLOB, you can use the FileStream or File Table in SQL Server 2012. The advantages to File Table seem like a no-brainier (but admittedly I have no personal first-hand experience with them.)
The article is definitely worth a read.
You might read up on FILESTREAM. Here is some info from the docs that should help you decide:
If the following conditions are true, you should consider using FILESTREAM:
Objects that are being stored are, on average, larger than 1 MB.
Fast read access is important.
You are developing applications that use a middle tier for application logic.
For smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.

Postgresql - one database for everyone, or one-database per customer

I'm working on a web-based business application where each customer will need to have their own data (think basecamphq.com type model) For scalability and ease-of-upgrades, I'd prefer to have a single database where each customer gets a filtered version of the data. The problem is how to guarantee that they stay sandboxed to their own data. Trying to enforce it in code seems like a disaster waiting to happen. I know Oracle has a way to append a where clause to every query based on a login id, but does Postgresql have anything similar?
If not, is there a different design pattern I could use (like creating a view of each table for each customer that filters)?
Worse case scenario, what is the performance/memory overhead of having 1000 100M databases vs having a single 1Tb database? I will need to provide backup/restore functionality on a per-customer basis which is dead-simple on a single database but quite a bit trickier if they are sharing the database with other customers.
You might want to look into adding Veil to your PostgreSQL installation.
Schemas plus inherited tables might work for this, create your master table then inherit tables into per-customer schemas which provide a company ID or name field default.
Set the permissions per schema for each customer and set the schema search path per user. Use the same table names in each schema so that the queries remain the same.

Working with images in WCF

I have a desktop application that needs to upload/download images to/from service computer over TCP Protocol.
At first, I stored images in file system, but I need to in MS SQL DB to compare which solution is better. Number of images is over half a million. I don't know yet will there be any limitation on size of a photo.
If you have done smth like that, please, write what your opinion upon this question.
Which one is faster, more safe? Which of them works better with this number of photos? If I'll store on DB, do I need to store images apart from all other tables which I use for my application and which type works better - image or varbinary on DB?..and so on.
Thank you.
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!
Which version of SQL server? Version 2008 adds FILESTREAM which is specifically designed for this purpose. FILESTREAM data can be located on disk which makes it very fast to access.
If this is not an option, you could look into creating a separate filegroup for your image data (to give you the most flexibility when partitioning your data) and use the varbinary(max) or image data types.
A SQL guru will probably chime in with better info.

Resources