Can we use snowflake as database for Data driven web application? - snowflake-cloud-data-platform

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.

Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.

You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.

I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.

Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

Related

Is "Snowflake Data Cloud" a good choice for a cloud-native transactional application data-store?

Currently, I generate data on a different datastore and replicate to Snowflake Staging, then that data moves to the Data Warehouse DB through ELT ingestion for Analytics purpose. However this approach can be considered as creating data-silos in itself, since we already have 3 copies of the same data:
Transactional data-store DB
Replicated snowflake staging
Snowflake Data Warehouse DB
From a technical architecture point of view, is it a good idea to use Snowflake as a direct datastore for transactional application? (application that does many CRUD operations). That may help in avoiding the cost of replication and ingestion.
The main problem I see with this approach is that: Snowflake does not enforce any referential integrity (primary keys, foreign keys) so within the CRUD app, I have to either use a MERGE statement always or somehow make sure I don't create duplicate records.
The other problem being in the cloud, the distance (aka network) between the app and snowflake decides the performance of the transactions, I want good, consistent performance of my CRUD operations.
Any thoughts/suggestions are much appreciated.
Snowflake as of today does not perform well with singleton updates and inserts, which is what we see mostly with transactional databases. I have seen a performance degradation when using singleton inserts are submitted against Snowflake.
On the contrary, they are very optimized for bulk ingestion of unstructured data and structured data though and are designed for OLAP warehouses. You can still use it but you may see the same performance degradation. Also, primary keys can be defined but they are not enforced.
In my opinion, if you are faced with that challenge, you have the option to use a Postgre SQL DB (open source) in the cloud as your transactional database and it acts as a good complement to Snowflake as the OLAP database.
No. Snowflake isn't good as a transactional / OLTP database for the reasons you've mentioned. Plus, it won't perform well with many individual CRUD operations due to how they structure the data (optimised for OLAP workloads).
Just want to point out that there are benefits to creating separate databases, for one you want to isolate your transactional database from that of your analytics database otherwise you could be significantly affect the performance of the application. Secondly, the data in the transactional database could change and if you had to reprocess the data for whatever reason you may not be able to do so. There are many more, but I will stop here :-)

Loading data from SQL Server to Elasticsearch

Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).

GCP Storage for large temporary data

I'm using a Cloud SQL instance to store two types of data: typical transactional data and large "read-only" data. Each of these read-only tables could have GBs of data and they work like snapshots that are refreshed once a day. The old data is totally replaced by the most recent data. The "read-only" tables reference data from the "transactional tables", but I don't necessarily need to perform joins between them, so they're kind of "independent".
In this context, I believe using Cloud SQL to store these kind of tables are going to be a problem in terms of billing. Because Cloud SQL is fully managed, I would be paying for maintenance work from Google and I wouldn't need any kind of maintenance for those specific tables.
Maybe there are databases more suitable for storing snapshot/temporary data. I'm considering to move those type of tables to another kind of storage, but it's possible that I would end up making the bill even higher. Or maybe I could continue using Cloud SQL for those tables and just unlog them.
Can anyone help me with this? Is there any kind of storage in GCP that would be great for storing large snapshots that are refreshed once a day? Or is there an workaround to make Cloud SQL not maintain those tables?
This is a tough question because there are a lot of options and a lot of things that could work. The GCP documentation page "Choosing a Storage Option" is very handy in this kind of cases. It has a flowchart to select a storage option based on the kind of data you want to store, a video that explains each storage option and a table with the description, strong points and use cases for each option. I would recommend to start there.
Also, if the issue with Cloud SQL is that is fully managed and pricy, you can set up MySQL on Google Compute Engine and manage it yourself. Is also fairly cheaper for the same machine. For a n1-standard-1, $0.0965 in Cloud SQL and $0.0475 in GCE (keep in mind that other charges may apply on top of the machine price)

Replicating PostgreSQL data for analytics

I'm currently scoping out at a potential development project where we will develop an analytical solution to support a production application. Obviously we want to run queries on reasonably up-to-date data, but we don't want the operational risk of querying the main database directly with (possibly expensive) analytical queries.
To do this I believe we would like to do the following:
Make a replica of a "production" PostgreSQL database into a separate "analytics" database
Add additional tables / views etc to the "analytics" database, which will support the analytics solution only and not be part of the application DB.
Maintain the replica copy of the production data in a reasonably up-to-date fashion (realtime replication not strictly needed, but no more than a few seconds lag would be good)
The database will not be excessively large (it is a web/mobile application with a lot of users but most not likely to be active at any one time).
Is this likely to be feasible with PostgreSQL, and if so what is the best strategy / replication technique to use?
You cannot use streaming replication for that, because you cannot add tables to a read-only database. But you might rethink the requirement to not add the additional tables to the production database.
However, there are other replication techniques like Slony, Bucardo or Londiste.
One thing that you should keep in mind is that a data model that is suitable for an online transaction processing database is usually not well suited for analytical applications, and you might end up being pretty unhappy with the performance of your analytical queries. For these, the normal thing to do is to build some sort of data warehouse where data are stored in a more denormalized form, usually in something like a star schema.
But for that you cannot have “no more than a few seconds lag”. Double check if that is really essential, it usually isn't for analytical queries.

Advantages of Hadoop in combination to any database

There are so many different databases.
relational databases
nosql databases
key/value
document store
wide columns store
graph databases
And database technologies
in-memory
column oriented
All have their advantages and disadvantages.
For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.
I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.
Is it right to say, that hadoop can make it easier to choose the right database, because it can be used at first as a data storage? so if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?
First and Foremost, Hadoop is not a database. It is a distributed Filesystem.
For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.
The choice of database for a project depends on these factors,
Nature of the data storage and retrieval
If it is meant for transactions, It is highly recommended that you stick to an ACID database.
If it is to be used for web-applications or Random Access, then you have wide variety of choices from the traditional SQL ones and to the latest database technologies which support HDFS as storage layer, like HBase. Traditional are well suited for Random Access as they highly support constraints and indexes.
If analytical batch processing is the concern, based on the structure complexity and volume, choice can be made among all the available ones.
Data Format or Structure
Most of the SQL databases support Structured data (the data which can be formatted into tables), some do extend their support beyond that for storing JSON and likewise.
If the data is unstructured, especially flatfiles, storing and processing them can be easily done with any Bigdata supportive technologies like Hadoop, Spark, Storm. Again these technologies will come into picture only if the volume is high.
Different database technologies play well for different data formats. For example, Graph databases are well suited for storing structures representing relationships or graphs.
Size
This is the next bigger concern, more the data more the need for scalability. So it is better to choose a technology that supports Scale-Out Architecture (Hadoop, NoSql) than Scale-In. This could become a bottleneck in the future when you plan to store more.
I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.
Yes, you can use HDFS as your storage layer and use any of the HDFS supported databases to do the processing(Choice of Processing framework is another concern to choose from batch to near real time to real time). To be noted is that Relational databases do not support HDFS storage. Some NoSql databases, like MongoDB, also support HDFS storage.
if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?
This could be tricky depending upon which database you want to pair with afterwards.
HDFS is not a posix-compatible filesystem, so you can't just use it as a general purpose storage and then deploy any DB on top of it. The database you'll deploy should have explicit support for HDFS. There are a few options: HBase, Hive, Impala, SolR.

Resources