best approach for storing images - database

I am working on a .NET web application. The application has a separate database (SQL Server 2012) for storing and retrieving images. we have functionality in an application where multiple scanned documents are attached and stored in a database against each process. The daily increase in the size of this document DB is about 9 GB. I am stuck with the handling of this huge document DB. I am looking for the following questions
Should we use SQL Server database or Mongo DB or any other database for this scenario?
What should we do to better storing these daily increasing images, should we use any partitioning technique or any archiving technique, or any file system management technique.
what are the best practices for managing such huge images databases?
how to control its size limit, either we should apply any compression technique?
The system is mainly used for inserting documents and these documents are accessed rarely.
how can we optimize its storage?

Related

Loading data from SQL Server to Elasticsearch

Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

SQL Server vs. No-SQL Database

I have inherited a legacy content delivery system and I need to re-design & re-build it. The content is delivered by content suppliers (e.g. Sony Music) and is ingested by a legacy .NET app into a SQL Server database.
Each content has some common properties (e.g. Title & Artist Name) as well as some content-type specific properties (e.g. Bit Rate for MP3 files and Frame Rate for video files).
This information is stored in a relational database in multiple tables. These tables might have null values in some of their fields because those fields might not belong to a property of the content. The database is constantly under write operations because the content ingestion system is constantly receiving content files from the suppliers and then adds their metadata to the database.
Also, there is a public facing web application which lets end users buy the ingested contents (e.g. musics, videos etc). This web application totally relies on an Elasticsearch index. In fact this application does not see the database at all and uses the Elasticsearch index as the source of data. The reason is that SQL Server does not perform as fast and as efficient as Elasticsearch when it comes to text-search.
To keep the database and Elasticsearch in sync there is a Windows service which reads the updates from SQL Sever and writes them to the Elasticsearch index!
As you can see there are a few problems here:
The data is saved in a relational database which makes the data hard to manage. e.g. there is a table of 3 billion records to store metadata of each contents as a key value pairs! To me using a NoSQL database or index would make a lot more sense as they allow to store documents with different formats in them.
The Elasticsearch index needs to be kept in Sync with the database. If the Windows services does not work for any reason then the index will not get updated. Also when there are too many inserts/updates in the database it takes a while for the index to get updated.
We need to maintain two sources of data which has cost overhead.
Now my question: is there a NoSQL database which has these characteristics?
Allows me to store documents with different structures in it?
Provides good text-search functions and performance? e.g. Fuzzy search etc.
Allows multiple updates to be made to its data concurrently? Based on my experience Elasticsearch has problems with concurrent updates.
It can be installed and used at Amazon AWS infrastructure because our new products will be hosted on AWS. Auto scaling and clustering is important. e.g. DynamoDB.
It would have a kind of GUI so that support staff or developers could modify the data to some extent.
A combination of DynamoDB and ElasticSearch may work for your use case.
DynamoDB certainly supports characteristics 1, 3, 4, and 5.
There is now a Logstash Input Plugin for DynamoDB that can be combined with an ElasticSearch output plugin to keep your table and index in sync in real time. ElasticSearch provides characteristic 2.

Is it possible to map filegroups in sql server 2012 to Azure Blob Storage?

Currently our MS SQL server 2012(on-premises) uses filestream to store files. The filestreams are stored in separate filegroups. But as the number files is increasing we are afraid it will affect the overall performance of the database. So we are planning to move these files to Azure Blob Storage.
I found an article in MSDN about keeping data files in azure storage: https://msdn.microsoft.com/en-us/library/dn466438.aspx
But believe this is for SQL Server 2014.
So, is there a similar approach for 2012?
Or am I left with uploading the file to azure from application and keeping the path in database? Uploading file from application is a very big rework and maintaining ACID properties poses a question of doubt.
As you know, FileStream guarantees transactional consistency between relational tables in SQL Server and BLOBs stored outside of SQL Server, in the file system.
SQL just stores references to the BLOBs to handle DML operations across them.
This keeps their overhead small. Maintenance operations (e.g. Backup/Restore, index rebuilds, etc) apply only to relational tables (no BLOBs).
Because of this, deployments using FileStream can scale to TBs of BLOBs without significantly impacting SQL Server performance. To increase I/O throughput add multiple files, striped across different disks, to the FileStream filegroups.
Other recommendations at https://technet.microsoft.com/en-us/library/dd206979(v=sql.105).aspx
SQL Server Data Files (introduced in SQL14) doesn't support FileStream/FileTable (the underlying driver doesn't recognize the Azure BLOB path). In any case, I expect doing this would impact transactional performance due to high network latency compared to local storage.

Using SQL Server as Image store

Is SQL Server 2008 a good option to use as an image store for an e-commerce website? It would be used to store product images of various sizes and angles. A web server would output those images, reading the table by a clustered ID. The total image size would be around 10 GB, but will need to scale. I see a lot of benefits over using the file system, but I am worried that SQL server, not having an O(1) lookup, is not the best solution, given that the site has a lot of traffic. Would that even be a bottle-neck? What are some thoughts, or perhaps other options?
10 Gb is not quite a huge amount of data, so you can probably use the database to store it and have no big issues, but of course it's best performance wise to use the filesystem, and safety-management wise it's better to use the DB (backups and consistency).
Happily, Sql Server 2008 allows you to have your cake and eat it too, with:
The FILESTREAM Attribute
In SQL Server 2008, you can apply the FILESTREAM attribute to a varbinary column, and SQL Server then stores the data for that column on the local NTFS file system. Storing the data on the file system brings two key benefits:
Performance matches the streaming performance of the file system.
BLOB size is limited only by the file system volume size.
However, the column can be managed just like any other BLOB column in SQL Server, so administrators can use the manageability and security capabilities of SQL Server to integrate BLOB data management with the rest of the data in the relational database—without needing to manage the file system data separately.
Defining the data as a FILESTREAM column in SQL Server also ensures data-level consistency between the relational data in the database and the unstructured data that is physically stored on the file system. A FILESTREAM column behaves exactly the same as a BLOB column, which means full integration of maintenance operations such as backup and restore, complete integration with the SQL Server security model, and full-transaction support.
Application developers can work with FILESTREAM data through one of two programming models; they can use Transact-SQL to access and manipulate the data just like standard BLOB columns, or they can use the Win32 streaming APIs with Transact-SQL transactional semantics to ensure consistency, which means that they can use standard Win32 read/write calls to FILESTREAM BLOBs as they would if interacting with files on the file system.
In SQL Server 2008, FILESTREAM columns can only store data on local disk volumes, and some features such as transparent encryption and table-valued parameters are not supported for FILESTREAM columns. Additionally, you cannot use tables that contain FILESTREAM columns in database snapshots or database mirroring sessions, although log shipping is supported.
Check out this white paper from MS Research (http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2006-45)
They detail exactly what you're looking for. The short version is that any file size over 1 MB starts to degrade performance compared to saving the data on the file system.
I doubt that O(log n) for lookups would be a problem. You say you have 10GB of images. Assuming an average image size of say 50KB, that's 200,000 images. Doing an indexed lookup in a table for 200K rows is not a problem. It would be small compared to the time needed to actually read the image from disk and transfer it through your app and to the client.
It's still worth considering the usual pros and cons of storing images in a database versus storing paths in the database to files on the filesystem. For example:
Images in the database obey transaction isolation, automatically delete when the row is deleted, etc.
Database with 10GB of images is of course larger than a database storing only pathnames to image files. Backup speed and other factors are relevant.
You need to set MIME headers on the response when you serve an image from a database, through an application.
The images on a filesystem are more easily cached by the web server (e.g. Apache mod_mmap), or could be served by leaner web server like lighttpd. This is actually a pretty big benefit.
For something like an e-commerce web site, I would be moe likely to go with storing the image in a blob store on the database. While you don't want to engage in premature optimization, just the benefit of having my images be easily organized alongside my data, as well as very portable, is one automatic benefit for something like ecommerce.
If the images are indexed then lookup won't be a big problem. I'm not sure but I don't think the lookup for file system is O(1), more like O(n) (I don't think the files are indexed by the file system).
What worries me in this setup is the size of the database, but if managed correctly that won't be a big problem, and a big advantage is that you have only one thing to backup (the database) and not worry about files on disk.
Normally a good solution is to store the images themselves on the filesystem, and the metadata (file name, dimensions, last updated time, anything else you need) in the database.
Having said that, there's no "correct" solution to this.

Resources