I have been asked to move an Azure Table Storage service from the Microsoft Dublin Data Centre to a data centre in the UK (any secure data centre so long as it is UK based). My problem is that Azure table storage is only supported in Microsoft data centres (Dublin and Amsterdam in Europe).
There are about 50 tables plus queues and blobs. The data requirements are for highly scalable storage. I do not want to re-write the storage into SQL Server because this will require schema management and direct management of the indexes for performance. I would prefer a performant nosql database that operates to the standards of Azure table storage.
Does anyone have any experience in this area?
As far as migrating your data, there's no automated way to do it. Check out the AzCopy utility for blobs.
As far as which database to choose, that's really going to be app-dependent. You'll need to think about search, storage, indexing, map/reduce/aggregation, etc. Then there's document, column, graph, key/value, sql - you may choose one or a combination.
Just remember that Table Storage is storage-as-a-service, with triple-replicated storage providing durability, and upwards of 2000 tps per partition, 20K tps per storage account. You'll need to build this out yourself, whatever you choose (maybe Redis?).
Anyway: This is pretty wide-open as far as solving your architecture. But hopefully I gave you something to think about.
One more thing: You should really look into the reasons for moving your data. Many people talk about data sovereignty, but sometimes it turns out that the data location doesn't violate any local data laws (or that some data can actually remain where it is, with only specific data needing to be hosted within a country's boundaries).
Related
I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.
I'm using a Cloud SQL instance to store two types of data: typical transactional data and large "read-only" data. Each of these read-only tables could have GBs of data and they work like snapshots that are refreshed once a day. The old data is totally replaced by the most recent data. The "read-only" tables reference data from the "transactional tables", but I don't necessarily need to perform joins between them, so they're kind of "independent".
In this context, I believe using Cloud SQL to store these kind of tables are going to be a problem in terms of billing. Because Cloud SQL is fully managed, I would be paying for maintenance work from Google and I wouldn't need any kind of maintenance for those specific tables.
Maybe there are databases more suitable for storing snapshot/temporary data. I'm considering to move those type of tables to another kind of storage, but it's possible that I would end up making the bill even higher. Or maybe I could continue using Cloud SQL for those tables and just unlog them.
Can anyone help me with this? Is there any kind of storage in GCP that would be great for storing large snapshots that are refreshed once a day? Or is there an workaround to make Cloud SQL not maintain those tables?
This is a tough question because there are a lot of options and a lot of things that could work. The GCP documentation page "Choosing a Storage Option" is very handy in this kind of cases. It has a flowchart to select a storage option based on the kind of data you want to store, a video that explains each storage option and a table with the description, strong points and use cases for each option. I would recommend to start there.
Also, if the issue with Cloud SQL is that is fully managed and pricy, you can set up MySQL on Google Compute Engine and manage it yourself. Is also fairly cheaper for the same machine. For a n1-standard-1, $0.0965 in Cloud SQL and $0.0475 in GCE (keep in mind that other charges may apply on top of the machine price)
I have a legacy project that I like to migrate piece by piece. Now, the data is tightly connected.
For example:
Flight information
Crew info
Passengers info
Airports
Flights can have many airports too
I also would like to have the database that is scaleable and flexible in changing structure. I’m using AWS.
I had a look at DynamoDB, it has the flexibility that I’m looking for but when I found it’s hard to query a particular single item in 1-to-many for example.
I also know how inflexible it is to change structure or schema in RDS.
Any suggestions?
The decision is whether to use a NoSQL database or a Relational Database that uses SQL.
In situations where you will be querying information across many types of data (eg "How many flights from LAX with more than 100 passengers departed more than 15 minutes late"), a relational database makes things much easier.
They can be slower, but queries are much easier to write. (For example, the above question could be done in one query.)
A NoSQL database is excellent when you need predictable performance (reads per second), but cannot join information across tables.
Once you have picked your database type, then you can choose the database engine. Amazon Relational Database Service (RDS) offers:
MySQL (Open-source, low cost)
Aurora (Cloud version of MySQL, much more performant)
MariaDB
Microsoft SQL Server (higher cost)
Oracle (higher cost)
If in doubt, go with Aurora since it is more cloud-native and is fully compatible with MySQL. There is now a serverless version of Aurora that can automatically scale and even turn off when unused to lower costs.
As to flexibility of schemas, all SQL databases basically offer the same flexibility via ALTER TABLE.
Can an enterprise use both data warehouse and database in one Head Office? Is it just OK to use only one of these or is it necessary to use both in the same place?
Yes, an enterprise can use both data warehouse and database in one office. They need not be in the same physical data-center. It all depends on the needs of the organization. Generally, databases are used to support transactions as they happen and data warehouses are used to support business intelligence or the like.
Database
Transactions in an enterprise is most likely to happen in a relational database management system (aka database, aka RDBMS). Reporting can happen using the same database, but it is also possible that reporting is done off of a mirror of the RDBMS. Now, an enterprise may have more than one RDBMS - one running SQL Server, one running Oracle, one running MySQL etc. All this is great for recording activities and reporting.
Warehouse
Additionally, enterprises seek to do data analysis on a regular basis. Business Intelligence, data science, big data - regardless of the term, we are talking about data analysis overall. Doing number crunching on large amounts of data stored in an RDBMS can be hard on the RDBMS. So, organizations decide to de-normalize data to some extent and store data in a warehouse. When data is extracted, transformed and loaded (ETL) from one or more RDBMS (and other sources of data) and stored in a data warehouse, it is available for some research.
Organizations may choose to move the warehouse to a different office location, or may have multiple-warehouses. For example, a headquarter with 5 satellite facilities may choose to bring data from all those facilities to the warehouse at the headquarter every night, or it may choose to have a warehouse in a different datacenter. In contrast to that, a company with hundreds of satellite facilities may choose to have a warehouse with high-level summarized data at their headquarter and regionalize their warehouses; one warehouse in each continent, so that target markets are better served by satellite units in that particular continent.
Database (or databases) to Warehouse journey
Business Intelligence
Cognos, QlikView, Tableau, Microstrategy etc. are some business intelligence/data analytic tools among many that reach out to the data warehouse and present data for analytics. They are great for presentation and reporting - data visualization, in general. These tools can also get data from RDBMS, but it's convenient to get it from data warehouse since they are architected in a way to make it easier to showcase that data on a business intelligence dashboard
Example of a dashboard:
Big data
The buzzword around big data is interesting. Many of us may take a subset of data from a large pool of data, do analysis and assume that the results from the subset applies to the large pool of data. What if all the data was used for analysis? And even better - what if we took related data from elsewhere (outside of our dataset) and included it in our analysis? Yeah, you would have a giant pile of data and if you had means to analyze them all, you'd be doing big data. We are talking several hundred GB or even PB of data. Although Hadoop and the like are used in big data analysis, they could derive that data from the warehouse.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Storing Images in DB - Yea or Nay?
For ages I've been told not to store images on the database, or any big BLOB for that matter. While I can understand why the databases aren't/weren't efficient for that I never understood why they couldn't. If I can put a file somewhere and reference it, why couldn't the database engine do the same. I'm glad Damien Katz mentioned it on a recent Stack Overflow podcast and Joel Spolsky and Jeff Atwood, at least silently, agreed.
I've been reading hints that Microsoft SQL Server 2008 should be able to handle BLOBs efficient, is that true? If so, what is there stopping us from just storing images there and getting rid of one problem? One thing I can think of is that while the image can be served by a static web server very quickly if it's a file somewhere, when it's in the database it has to travel from the database to the web server application (which might be slower than the static web server) and then it's served. Shouldn't caching help/solve that last issue?
Yes, it's true, SQL Server 2008 just implemented a feature like the one you mention, it's called a filestream. And it's a good argument indeed for storing blobs in a DB, if you are certain you will only want to use SQL Server for your app (or are willing to pay the price in either performance or in developing a similar layer on top of the new DB server). Although I expect similar layers will start to appear if they don't already exist for different DB servers.
As always what would the real benefits be depend on the particular scenario. If you will serve lots of relatively static, big files, then this scenario plus caching will probably be the best option considering a performance/manageability combo.
This white paper describes the FILESTREAM feature of SQL Server 2008, which allows storage of and efficient access to BLOB data using a combination of SQL Server 2008 and the NTFS file system. It covers choices for BLOB storage, configuring Windows and SQL Server for using FILESTREAM data, considerations for combining FILESTREAM with other features, and implementation details such as partitioning and performance.
Just because you can do something doesn't mean you should.
If you care about efficiency you'll still most likely not want to do this for any sufficiently large scale file serving.
Also it looks like this topic has been heavily discussed...
Exact Duplicate: User Images: Database or filesystem storage?
Exact Duplicate: Storing images in database: Yea or nay?
Exact Duplicate: Should I store my images in the database or folders?
Exact Duplicate: Would you store binary data in database or folders?
Exact Duplicate: Store pictures as files or or the database for a web app?
Exact Duplicate: Storing a small number of images: blob or fs?
Exact Duplicate: store image in filesystem or database?
I'll try to decompose your question and address your various parts as best I can.
SQL Server 2008 and the Filestream Type - Vinko's answer above is the best one I've seen so far. The Filestream type is the SQL Server 2008 is what you were looking for. Filestream is in version 1 so there are still some reasons why I wouldn't recommend using if for an enterprise application. As an example, my recollection is that you can't split the storage of the underlying physical files across multiple Windows UNC paths. Sooner or later that will become a pretty serious constraint for an enterprise app.
Storing Files in the Database - In the grander scheme of things, Damien Katz's original direction was correct. Most of the big enterprise content management (ECM) players store files on the filesystem and metadata in the RDBMS. If you go even bigger and look at Amazon's S3 service, you're looking at physical files with a non-relational database backend. Unless you're measuring your files under storage in the billions, I wouldn't recommend going this route and rolling your own.
A Bit More Detail on Files in the Database - At first glance, a lot of things speak for files in the database. One is simplicity, two is transactional integrity. Since the Windows file system cannot be enlisted in a transaction, writes that need to occur across the database and filesystem need to have transaction compensation logic built in. I didn't really see the other side of the story until I talked to DBAs. They generally don't like commingling business data and blobs (backup becomes painful) so unless you have a separate database dedicated to file storage, this option is generally not as appealing to DBAs. You're right that the database will be faster, all other things being equal. Not knowing the use case for your application, I can't say much about the caching option. Suffice it to say that in many enterprise applications, the cache hit rate on documents is just too darn low to justify caching them.
Hope this helps.
One of the classical reasons for caution about storing blobs in databases is that the data will be stored and edited (changed) under transaction control, which means that the DBMS needs to ensure that it can rollback changes, and recover changes after a crash. This is normally done by some variation on the theme of a transaction log. If the DBMS is to record the change in a 2 GB blob, then it has to have a way of identifying what has changed. This might be simple-minded (the before image and the after image) or more sophisticated (some sort of binary delta operation) that is more computationally expensive. Even so, sometimes the net result will be gigabytes of data to be stored through the logs. This hurts the system performance. There are various ways of limiting the impact of the changes - reducing the amount of data flowing through the logs - but there are trade-offs.
The penalty for storing filenames in the database is that the DBMS has no control (in general) over when the files change - and hence again, the reproducibility of the data is compromised; you cannot guarantee that something outside the DBMS has not changed the data. (There's a very general version of that argument - you can't be sure that someone hasn't tampered with the database storage files in general. But I'm referring to storing a file name in the database referencing a file not controlled by the DBMS. Files controlled by the DBMS are protected against casual change by the unprivileged.)
The new SQL Server functionality sounds interesting. I've not explored what it does, so I can't comment on the extent to which it avoids or limits the problems alluded to above.
There are options within SQL Server to manage where it stores large blobs of data, these have been in there since at lease SQL2005 so I don't know why you couldn't store large BLOBs of data. MOSS for instance stores all of the documents you upload to it in a SQL database.
There are of course some performance implications, as with just about anything, so you should take care that you don't retreive the blob if you don't need it, and don't include it in indexes etc.