Which aws database should I use? - database

I have a legacy project that I like to migrate piece by piece. Now, the data is tightly connected.
For example:
Flight information
Crew info
Passengers info
Airports
Flights can have many airports too
I also would like to have the database that is scaleable and flexible in changing structure. I’m using AWS.
I had a look at DynamoDB, it has the flexibility that I’m looking for but when I found it’s hard to query a particular single item in 1-to-many for example.
I also know how inflexible it is to change structure or schema in RDS.
Any suggestions?

The decision is whether to use a NoSQL database or a Relational Database that uses SQL.
In situations where you will be querying information across many types of data (eg "How many flights from LAX with more than 100 passengers departed more than 15 minutes late"), a relational database makes things much easier.
They can be slower, but queries are much easier to write. (For example, the above question could be done in one query.)
A NoSQL database is excellent when you need predictable performance (reads per second), but cannot join information across tables.
Once you have picked your database type, then you can choose the database engine. Amazon Relational Database Service (RDS) offers:
MySQL (Open-source, low cost)
Aurora (Cloud version of MySQL, much more performant)
MariaDB
Microsoft SQL Server (higher cost)
Oracle (higher cost)
If in doubt, go with Aurora since it is more cloud-native and is fully compatible with MySQL. There is now a serverless version of Aurora that can automatically scale and even turn off when unused to lower costs.
As to flexibility of schemas, all SQL databases basically offer the same flexibility via ALTER TABLE.

Related

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

Move from Azure Table Storage to other database

I have been asked to move an Azure Table Storage service from the Microsoft Dublin Data Centre to a data centre in the UK (any secure data centre so long as it is UK based). My problem is that Azure table storage is only supported in Microsoft data centres (Dublin and Amsterdam in Europe).
There are about 50 tables plus queues and blobs. The data requirements are for highly scalable storage. I do not want to re-write the storage into SQL Server because this will require schema management and direct management of the indexes for performance. I would prefer a performant nosql database that operates to the standards of Azure table storage.
Does anyone have any experience in this area?
As far as migrating your data, there's no automated way to do it. Check out the AzCopy utility for blobs.
As far as which database to choose, that's really going to be app-dependent. You'll need to think about search, storage, indexing, map/reduce/aggregation, etc. Then there's document, column, graph, key/value, sql - you may choose one or a combination.
Just remember that Table Storage is storage-as-a-service, with triple-replicated storage providing durability, and upwards of 2000 tps per partition, 20K tps per storage account. You'll need to build this out yourself, whatever you choose (maybe Redis?).
Anyway: This is pretty wide-open as far as solving your architecture. But hopefully I gave you something to think about.
One more thing: You should really look into the reasons for moving your data. Many people talk about data sovereignty, but sometimes it turns out that the data location doesn't violate any local data laws (or that some data can actually remain where it is, with only specific data needing to be hosted within a country's boundaries).

Best practices to copy a large number of very big databases with SSIS

I have been tasked with getting a copy of our SQL Server 2005/2008 databases in the field on-line internally and update them daily. Connectivity with each site is regulated, so on-line access is not an option. Field databases are Workgroup licensed. Main server is Enterprise with some obscene number of processors and RAM. The purpose of the copies is two-fold: (1) on-line backup and (2) source for ETL to the data warehouse.
There are about 300 databases, identical schema for the most part, located throughout the US, Canada and Mexico. Current DB size is between 5 GB and over 1 TB. Activity varies, but is about a 1,500,000 new rows daily on each server, mostly in 2 tables. About 50 tables total in each. Connection quality and bandwidth with each site varies, but the main site has enough bandwidth to do many sites in parallel.
I'm thinking SSIS, but am not sure how to approach this task other than table-by-table. Can anyone offer any guidance?
Honestly, I would recommend using SQL replication. We do this quite a bit, and it will even work over dialup. It basically minimizes traffic needed as only changes are transferred.
There are several topologies. We only use merge (two way), but transactional might be OK for your needs (one way).
Our environments are a single central DB, replicating (using filtered replication articles) to various site databases. The central DB is the publisher. It is robust, once in place, but is a nuisance for schema upgrades.
However, given your databases aren't homogeneous, it might be easier to set it up where the remote site is the publisher, and the central SQL instance has a per site database that is a subscriber to the site publisher. The articles wouldn't even need to be filtered. And then you can process the individual site data centrally.
Note to have the site databases would need replication components installed (they are generally optional in the installer). To be setup as publishers they'd need local configuration also (distribution configured on each one). Being workgroup edition, it can act as a publisher. SQL express can't act as a publisher.
It sounds complicated, but it is really just procedural, and an inbuilt mechanism for doing this sort of thing.

Databse architecture (single db vs client specific db) for Building Enterprise Web (RIA) application on cloud

We are working on rewriting our existing RIA and redesigning our database to re-architect it's design. Now we have 2 opinions about database:
(This choices are for SaaS based hosting.)
1) Individual database for each customer.
2) Single DB for all customers.
We are expecting good amount of data, some of our customers have db size ranging from 2GB to 10GB. # of tables are around 100.
Can I get an answer about which choice we shall go for?
We are not thinking about NoSQL solution as of now but we are planning to support about 4-5 databases with JPA (Java Persistence API) which includes MySQL, Postgres, Oracle, MSSQL for now.
P.S: We might leverage Amazon cloud for hosting.
The three main techniques that is usually applied to the database usage for this kind of a multi-tenant requirement is below. You have already specified some of them.
Separate databases for each tenant:
very high cost, easy to maintain/customize, easy to tune, easy to backup, easy to code to.
Shared database but different schema:
Low cost compared to (1), may encounter issues quickly with increased db size, easy to personalize per tenant, difficult to backup/restore per tenant, easy to code to.
Shared database Shared schema:
Low cost, load of one tenant will affect others, security and app development a challenge, difficult to personalize per tenant, difficult to restore/backup.
I think the above points hold good for hosting on premise or on cloud.
If you see the number of tenants growing or the data getting bigger then 1) or 2) is better. I have used option 2) and have seen it helping development and maintenance.

What is the best database technology for storing OHLC historical prices?

Just for End-Of-Day data there will be billions of rows. What is the best way to store all that data. Is SQL Server 2008 good enough for that or should I look towards NoSQL solution, like MongoDB. Any suggestions?
That would be cool to have one master db with read/write permissions and one ore more replications of it for read only operations. Only master database will be used for adding new prices into the storage. Also that would be cool to be able replicate OHLC prices for most popular securities individually in order to optimize read access.
This data then will be streamed to a trading platform on clients' machines.
You should consider Oracle Berkeley DB which is in production doing this within the infrastructure of a few well known stock exchanges. Berkeley DB will allow you to record information at a master as simple key/value pairs, in your case I'd imagine a timestamp for the key and an encoded OHLC set for the value. Berkeley DB supports single master multi-replica replication (called "HA" for High Availability) to support exactly what you've outlined - read scalability. Berkeley DB HA will automatically fail-over to a new master if/when necessary. Using some simple compression and other basic features of Berkeley DB you'll be able to meet your scalability and data volume targets (billions of rows, tens of thousands of transactions per second - depending on your hardware, OS, and configuration of BDB - see the 3n+1 benchmark with BDB for help) without issue.
When you start working on accessing that OHLC data consider Berkeley DB's support for bulk-get and make sure you use the B-Tree access method (because your data has order and locality will provide much faster access). Also consider the Berkeley DB partitioning API to split your data (perhaps based on symbol or even based on time). Finally, because you'll be replicating the data you can relax the durability constraints to DB_TXN_WRITE_NOSYNC as long as your replication acknowledgement policy is requires a quorum of replicas ACK a write before considering it durable. You'll find that a fast network beats a fast disk in this case. Also, to offload some work from your master, enable peer-to-peer log replica distribution.
But, first read the replication manager getting started guide and review the rep quote example - which already implements some of what you're trying to do (handy, eh?).
Just for the record, full disclosure I work as a product manager at Oracle on Berkeley DB products. I have for the past nine years, so I'm a tad biased. I'd guess that the other solutions - SQL based or not - might eventually give you a working system, but I'm confident that Berkeley DB can without too much effort.
If you're really talking billions of new rows a day (Federal Express' data warehouse isn't that large), then you need an SQL database that can partition across multiple computers, like Oracle or IBM's DB2.
Another alternative would be a heavy-duty system managed storage like IBM's DFSMS.

Resources