GCP Storage for large temporary data

GCP Storage for large temporary data - database

I'm using a Cloud SQL instance to store two types of data: typical transactional data and large "read-only" data. Each of these read-only tables could have GBs of data and they work like snapshots that are refreshed once a day. The old data is totally replaced by the most recent data. The "read-only" tables reference data from the "transactional tables", but I don't necessarily need to perform joins between them, so they're kind of "independent".
In this context, I believe using Cloud SQL to store these kind of tables are going to be a problem in terms of billing. Because Cloud SQL is fully managed, I would be paying for maintenance work from Google and I wouldn't need any kind of maintenance for those specific tables.
Maybe there are databases more suitable for storing snapshot/temporary data. I'm considering to move those type of tables to another kind of storage, but it's possible that I would end up making the bill even higher. Or maybe I could continue using Cloud SQL for those tables and just unlog them.
Can anyone help me with this? Is there any kind of storage in GCP that would be great for storing large snapshots that are refreshed once a day? Or is there an workaround to make Cloud SQL not maintain those tables?

This is a tough question because there are a lot of options and a lot of things that could work. The GCP documentation page "Choosing a Storage Option" is very handy in this kind of cases. It has a flowchart to select a storage option based on the kind of data you want to store, a video that explains each storage option and a table with the description, strong points and use cases for each option. I would recommend to start there.
Also, if the issue with Cloud SQL is that is fully managed and pricy, you can set up MySQL on Google Compute Engine and manage it yourself. Is also fairly cheaper for the same machine. For a n1-standard-1, $0.0965 in Cloud SQL and $0.0475 in GCE (keep in mind that other charges may apply on top of the machine price)

Related

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.

Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.

You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.

I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.

Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

Need a solution to get rid of multiple database

In my company we have multiple database structure hosted in SQL Server.
for e.g., whenever a new customer sign up with us, we create a new DB in SQL Server to maintain their data.
Right now we already have 2000+ DBs in our database server. We expect more customers to sign up in near future, which might even cross 5000+ count.
Having DBs of 5000+ and increasing count of DBs might not be an advisable one, sometimes we run some task which will run across the DBs, and if we are going to run tasks across 5000+ DBs we will surely end up in performance issues.
What would be the alternative solution to avoid creating multiple DBs for each and every customer and also at the same time maintaining their data separately?
I am hearing about BigData and other DataBase solutions but could not get clear picture.
Can someone share some light on this?

If the databases have an identical schema you could combine them into one. That way each customer's table will now become a set of rows in the new database. A new customer will probably be a few new rows in the tables that store customer's profile.
You can use row level security for restricting access to customer's data:-
https://msdn.microsoft.com/en-us/library/dn765131.aspxpx
For pros and cons of using this approach over your existing see: Pros/Cons Using multiple databases vs using single database and Single or multiple databases
Using other options provide great learning opportunity but may have a significant transition cost even if there were some that were indeed better.

one solution I would suggest is to use prefix on the table name for each customer. you can then solve the security issue by limit per customer per set of tables.
the con is you will have to rewrite your application to use prefix to each table whenever it want to access it. If you have a lot of tables , that will be a problem.
I think this is how some multi Wordpress hosting site handle it database issue.

you should consider if you just store the data and access it with simple querys or if you usually do complex query's, if you just store the data and access it with simple querys and your need are not 100% relational maybe you should consider to move part of your data to HDFS file system:
https://en.wikipedia.org/wiki/Apache_Hadoop#HDFS .
To process the data in hadoop there are many tools but the raising one for sure is spark:
https://en.wikipedia.org/wiki/Apache_Spark
probably the best solution is to start move your historic data in HDFS just for storage and keep the rest as it is until you take confidence with the hadoop and spark paradigm
hadoop is a distributed , fault tollerant file system and spark is an engine for batch processing huge amount of unstructured or structured data, consider that data in hadoop are not structure usually so you have to change the way you process your data, if you want to still use sql I suggest to check Impala and Hive as well:
http://impala.io/
https://hive.apache.org/
Take a look at cloudera web site for a more structure IT solution instead of a lot of single tool that you will need to organize
http://www.cloudera.com/content/www/en-us/solutions.html
They have a quick start VM to try all the hadoop ecosystem tools , probably thats the best way to start experimenting:
http://www.cloudera.com/content/www/en-us/downloads/quickstart_vms/5-4.html

Move from Azure Table Storage to other database

I have been asked to move an Azure Table Storage service from the Microsoft Dublin Data Centre to a data centre in the UK (any secure data centre so long as it is UK based). My problem is that Azure table storage is only supported in Microsoft data centres (Dublin and Amsterdam in Europe).
There are about 50 tables plus queues and blobs. The data requirements are for highly scalable storage. I do not want to re-write the storage into SQL Server because this will require schema management and direct management of the indexes for performance. I would prefer a performant nosql database that operates to the standards of Azure table storage.
Does anyone have any experience in this area?

As far as migrating your data, there's no automated way to do it. Check out the AzCopy utility for blobs.
As far as which database to choose, that's really going to be app-dependent. You'll need to think about search, storage, indexing, map/reduce/aggregation, etc. Then there's document, column, graph, key/value, sql - you may choose one or a combination.
Just remember that Table Storage is storage-as-a-service, with triple-replicated storage providing durability, and upwards of 2000 tps per partition, 20K tps per storage account. You'll need to build this out yourself, whatever you choose (maybe Redis?).
Anyway: This is pretty wide-open as far as solving your architecture. But hopefully I gave you something to think about.
One more thing: You should really look into the reasons for moving your data. Many people talk about data sovereignty, but sometimes it turns out that the data location doesn't violate any local data laws (or that some data can actually remain where it is, with only specific data needing to be hosted within a country's boundaries).

Cloud/hosted database/datastore services to replace local SQL Server instance

As a .NET web developer, I've always used SQL Server as my database store because it's already in the MSFT ecosystem and easy to work with from the .NET platform.
Recently, however, I had a computer almost literally blow up, and consequently lost all my data in SQL Server on that machine.
Now that I've got a new computer, I want to start using an off-site database so that this doesn't happen again. A database hosted by a third-party (i.e. hosting company) or cloud service.
It doesn't have to be SQL Server or even RMDBS necessarily, but if it's not, it'd be be something cutting-edge (e.g. redis, Cassandra, MongoDB, CouchDB, etc.) and not just MySQL or Postgre or something.
Does anyone have an recommendations for those with little financial means?
I'd like to be able to use it during development of projects, and if they ever go live, not have to migrate the data anywhere to a new service--keep the data right there where it is and point my live domain requiring the data to the same service it pointed while in development.

It's not so much a question of available hosted services as of what setup you want for your standard development environment. If one of the cloud datastores doesn't work for you, you can always get a virtual server and install whatever you need.
However, you may want to rethink the idea of putting dev databases in the cloud. Performance will not be as good as something running locally (particularly if you are working with things like bulk import), and turning a dev database into a production database isn't a particularly good idea. I think what you are really looking for is a combination of easy backup, schema management and data setup.
Backup on a live server is easy enough - either you are backing up the entire server or have a script that uploads the backup file somewhere. For dev I don't bother as I prefer to set up disposable environments - have code that can set up the database if it doesn't already exist and add any necessary default data. Most apps don't need much data unless there is some sort of import process involved, and the same code works quite nicely when you first set up the live environment.
Schema management is one of the more painful aspects of working with SQL and where NoSQL systems can make life a lot easier as most have the schema defined entirely by the code that is using it - I mostly use redis myself, but whether or not it is appropriate for you will depend on the type of project you work on - if you need a lot of joins or transactions you probably need SQL, but if you just need basic data storage most NoSQL platforms would be better.

May I suggest looking into Windows Azure table storage? It is quiet different from pure relational play of SQL Server, is the "next big thing" from Microsoft and is in general a somewhat of a paradigm shift for folks used to relational databases.
If you're ever going to come face to face with Azure in the future (and I suspect many .NET people will), it maybe a beneficial of an experience to have.
With respect to costs, they're negligible for individual use. 10,000 transactions a month cost a penny. A gigabyte per month of storage costs 15 cents, and data transfers are 10-15cents per gigabyte.
If you have only "development" projects that store their data in the cloud, I'll be damned if you pay more than $2-3/month to MS... if that :)

Google Cloud Datastore is in beta now and could be a good option for you. It's free up to 1GB and 50K requests per day. The API is rather low level. However, I wrote a high level ORM for GCD called Pogo that serializes and deserializes plain old objects into GCD entities.
Take a look at the documentation and open source here - http://code.thecodeprose.com/pogo
It's also available on Nuget called "Pogo".

Suggestions for a hosted database

I would like to have a SQL database online, but don't want to deal with its care and feeding. There are some commercial offerings out there for hosted DBs, for example Amazon SimpleDB. Can anybody suggest others, and if they used any of these services what their impressions were? Anything that helps me make an informed decision would be appreciated.
Edit: Since there's no one true answer, I've made this a community wiki.

Did you take a look the Amazon Relational Database Service. It is a MySql instance, and it is priced in a similar fashion to the EC2 products.

Google's AppEngine also has a SQL Database: http://code.google.com/appengine that is free, but it doesn't scale very well.
Amazon's SimpleDB is lacking a large chunk of the MySQL API, so if you want to go this route try and stick to SQL92 as much as possible. Also, keep in mind that you are changed per query. This means you want to make every query count. One way of doing that is by using relative updates:
UPDATE persondata SET age=age+1;
To be honest SimpleDB is a waste of money unless you need a large SQL cluster. I'd start off with a local sql db, when your load starts to get out of hand, move the sql db to its own server. After that, you will be looking at clustering your SQL db, and then SimpleDB starts to become an attractive solution.