I have a blob (storage account) that is housed on Azure. I also have a sql server table that is housed on Azure. I have a couple of questions
Is it possible to create a join between the blob and the table
Is it possible to store all of the information in the blob?
The table has address information on it and I wanted to be able to pull that information from the table and associate it or join it to the proper image by the ID in the sql table (if that is the best way)
Is it possible to create a join between the blob and the table?
No.
Is it possible to store all of the information in the blob?
You possibly could (by storing the address information as blob metadata) but it is not recommended because then you would lose searching capability. Blob storage is simply an object store. You won't be able to query on address information.
The table has address information on it and I wanted to be able to
pull that information from the table and associate it or join it to
the proper image by the ID in the sql table (if that is the best way)
Recommended way of doing this is storing the images in blob storage. Each blob in blob storage gets a unique URL (https://account.blob.core.windows.net/container/blob.png) that you can store in your database along with other address fields (e.g. create a column called ImageUrl and store the URL there).
Azure Storage (blob, in your case) and SQL Server are completely separate, independent data stores. You cannot do joins, transactions, or really any type of query, across both at the same time.
What you store in each is totally up to you. Typically, people store searchable/indexable metadata within a database engine (such as SQL Server in your case), and non-searchable (binary etc) content in bulk storage (such as blobs).
As far as "best way"? Not sure what you're looking for, but there is no best way. Like I said, some people will store anything searchable in their database. On top of this, they'd store a url to specific blobs that are related to that metadata. There's no specific rule about doing it this way, of course. Whatever works for you and your app...
Note: Blobs have metadata as well, but that metadata is not indexable; it would require searching through all blobs (or all blobs in a container) to perform specific searches.
Related
I want to store images in a sql database. The size of the image is between 50kb to 1mb. I was reading about a FileStream and a FileTable but I don't know which to choose. Each row will have 2 images and some other fields.
The images will never be updated/deleted and about 3000 rows will be inserted a day.
Which is recommend in this situation?
Originally it was always a bad idea to store files (= binary data) in a database. The usual workaround is to store the filepath in the database and ensure that a file actually exists at that path. It wás possible to store files in the database though, with the varbinary(MAX) data type.
sqlfilestream was introduced in sql-server-2008 and handles the varbinary column by not storing the data in the database files (only a pointer), but in a different file on the filesystem, dramatically improving the performance.
filetable was introduced with sql-server-2012 and is an enhancement over filestream, because it provides metadata directly to SQL and it allows access to the files outside of SQL (you can browse to the files).
Advice: Definitely leverage FileStream, and it might not be a bad idea to use FileTable as well.
More reading (short): http://www.databasejournal.com/features/mssql/filestream-and-filetable-in-sql-server-2012.html
In SQL Server, BLOBs can be standard varbinary(max) data that stores the data in tables, or FILESTREAM varbinary(max) objects that store the data in the file system. The size and use of the data determines whether you should use database storage or file system storage.
If the following conditions are true, you should consider using FILESTREAM:
Objects that are being stored are, on average, larger than 1 MB.
Fast read access is important.
You are developing applications that use a middle tier for application logic.
For smaller objects, storing varbinary(max) BLOBs in the database
often provides better streaming performance.
Benefits of the FILETABLE:
Windows API compatibility for file data stored within a SQL Server database. Windows API compatibility includes the following:
Non-transactional streaming access and in-place updates to FILESTREAM data.
A hierarchical namespace of directories and files.
Storage of file attributes, such as created date and modified date.
Support for Windows file and directory management APIs.
Compatibility with other SQL Server features including management tools, services, and relational query capabilities over FILESTREAM and file attribute data.
It depends. I personally will preffer link to the image inside the table. It is more simple and the files from the directory can be backed up separately.
You have to take into account several things:
How you will process images. Having only link allows you easily incorporates imges inside web pages (with propper config of the Web server).
How much are the images - if they are stored in the DB and they are a lot - this will increase the size of the DB and backups.
Are the images change oftenly - in that case it may be better to have them inside DB to have actual state of the backup inside DB.
I have a few questions regarding the process of copying tables from S3 to Snowflake.
The plan is to copy some data from AWS/S3 onto snowflake and then perform some modeling by DataRobot
We have some tables that contain PII data and we would like to hide those columns from Datarobot, what suggestion do you have for this problem?
The schema in AWS needs to match the schema in Snowflake for the copying process.
Thanks,
Mali
Assuming you know the schema of the data you are loading, you have a few options for using Snowflake:
Use COPY INTO statements to load the data into the tables
Use SNOWPIPE to auto-load the data into the tables (this would be good for instances where you are regularly loading new data into Snowflake tables)
Use EXTERNAL TABLES to reference the S3 data directly as a table in Snowflake. You'd likely want to use MATERIALIZED VIEWS for this in order for the tables to perform better.
As for hiring the PII data from DataRobot, I would recommend leveraging Snowflake DYNAMIC DATA MASKING to establish rules that obfuscate the data (or null it out) for the role that DataRobot is using.
All of these features are well-documented in Snowflake documentation:
https://docs.snowflake.com/
Regarding hiding your PII elements, you can use 2 different roles, one would be say data_owner(the role that will create the table and load the data in it) and another say data_modelling (for using data robot)
Create masking policies using the data owner such that the data robot cannot see the column data.
About your question on copying the data, there is no requirement that AWS S3 folder need to be in sync with Snowflake. you can create the external stage with any name and point it to any S3 folder.
Snowflake documentation has good example which helps to get some hands on :
https://docs.snowflake.com/en/user-guide/data-load-s3.html
I need to build a chat application. I also need to store all chat transcripts into some storage - SQL database, Table storage or other Azure mechanism.
It should store about 50 Mega of characters per day. Each bulk of text should be attach to a specific customer.
My question:
What is the best way to store such amount of text in Azure?
Thanks.
I would store them in Azure Tables, using conversationId as the partition key, and messageID as the rowKey. That way, you can easilly aggregate your statistics based on those two, and quickly retrieve conversations.
[Background]
Now I am creating WCF for keeping and getting articles of our university.
I need to save files and metadata of these files.
My WCF need to be used by 1000 person a day.
The storage will contains about 60000 aticles.
I have three different ways to do it.
I can save metadata(file name, file type) in sql server to create unique id) and save files into Azure BLOB storage.
I can save metadata and data into sql server.
I can save metadata and data into Azure BLOB storage.
What way do chose and why ?
If you suggest your own solution, it will be wondefull.
P.S. Both of them use Azure.
I would recommend going with option 1 - save metadata in database but save files in blob storage. Here're my reasons:
Blob storage is meant for this purpose only. As of today an account can hold 500TB of data and size of each blob can be of 200 GB. So space is not a limitation.
Compared to SQL Server, it is extremely cheap to store in blob storage.
The reason I am recommending storing metadata in database is because blob storage is a simple object store without any querying capabilities. So if you want to search for files, you can query your database to find the files and then return the file URLs to your users.
However please keep in mind that because these (database server and blob storage) are two distinct data stores, you won't be able to achieve transactional consistency. When creating files, I would recommend uploading files in blob storage first and then create a record in the database. Likewise when deleting files, I would recommend deleting the record from the database first and then removing blob. If you're concerned about having orphaned blobs (i.e. blobs without a matching record in the database), I would recommend running a background task which finds the orphaned blobs and delete them.
Currently putting together a base POC of architecture for an application that will be storing large amounts of records and each record will have one field that contains a few thousand characters.
e.g.
TableID int
Field1 nvarchar(50)
Field2 nvarchar(50)
Field2 nvarchar(MAX)
This is all being hosted in Azure. We have one webjob that does the work to obtain the data and populate it into the data store and then another webjob comes through periodically and processes the data.
Currently the data is just stored in an Azure SQL Database. I'm just worried once the record count turns into the many millions it's going to be incredibly inefficient to store/process/retrieve the data this way.
Advice required on the best way to store this in Azure. Wanted to start trying the fact that we keep the rows in Azure SQL but the large field's data is pushed into another repository (e.g. Data lake, DocumentDB) and has a reference back to the SQL record therefore the SQL calls are still lean and big data is stored somewhere else. Is this a clean manor of doing it or am I totally missing something?
Azure Table Storage can help with this solution - it is a NoSQL KeyValue store. Each entity can be up to 1MB in size. You could also use individual blobs as well. There is a design guide that includes a full description of how to design Table Storage solutions for scale - including patterns for using Table Storage along with other repositories see Table Design Guide
https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/