Partition level Access on Data stored in snowdfl - snowflake-cloud-data-platform

I am new to snowflake and was exploring on snowflake on AWS. When the data is stored in snowflake , i understood that we can create and manage data in partitions similar to what we do in hive. Hive doesn't allow me to have partition level user access management. Can I do that with snowflake ? if yes , how do we do and how its managed on storage layer on AWS?

With Snowflake, you have no direct access to the underlying storage, you can only use the access mechanisms that Snowflake provides. Snowflake manages all the provision, management and layout of your data on the underlying storage entirely transparently. So you can't "create and manage data in partitions similar to what we do in hive"
If you want to understand more about how this storage works you can read about micro-partitioning here
In the vast majority of cases there is no need to interfere with how Snowflake is laying out your data but there is the functionality available to force how the data is clustered - though Snowflake suggests that this is only ever useful on multi-terabyte tables. You can read about clustering tables here
Snowflake does have the concept of "External Tables" - these are tables that appear in the Snowflake DBs as normal tables but their data is actually held in S3 (or Azure Blob or GCP storage) that you own and manage rather than Snowflake. These tables can be convenient to create/use but perform significantly worse than tables held directly in Snowflake: when the data is loaded into Snowflake it might be still ultimately stored on S3 but it is compressed, converted into columnar format and held in micro-partitions - so very different in structure to the files you can see in your S3 buckets

Related

Azure Synapse, design questions of External tables or Internal tables

I'm designing a Dataware house in Azure Synapse using SQL Pool, but I'm facing some design questions.
Context: My plan is to load Partitioned Parquet files using Azure Data Lake Storage (ADLS), then, with SQL pool create External Tables to query those files.
My questions are:
Is it better in terms of performance to provide the solution just with the external tables? that is, with no create internal tables neither CTAS, BCP, or copy methods from the ADLS to storage in the database.
Is it possible to perform partitioning in external tables? is it enough to organize the parquet by folders named by date?
How does affect the user concurrency to the external tables and the internal tables? some experienced recommendations?.
Thanks for your time.
Josh
Is it better in terms of performance to provide the solution just with the external tables?
No. Internal Tables are distributed columnstores, with multiple levels of caching, and typically out-perform external parquet tables. Internal tables additionally support batch-mode scanning, columnstore ordering, segment elimination, partition elimination, materialized views, and resultset caching.
Is it possible to perform partitioning in external tables?
This is not currently possible in Dedicated SQL Pools, see Folder Partition Elimination
How does affect the user concurrency to the external tables and the internal tables?
Concurrency is a matter of query performance. The faster your queries perform, the faster sessions give up their concurrency slot. So anything that improves query performance improves the effective concurrency (the number of concurrent users you can support with reasonable query runtime).
Serverless SQL Pools currently have more advanced capabilities for working with data stored as Parquet or Delta in the Data Lake.

Is "Snowflake Data Cloud" a good choice for a cloud-native transactional application data-store?

Currently, I generate data on a different datastore and replicate to Snowflake Staging, then that data moves to the Data Warehouse DB through ELT ingestion for Analytics purpose. However this approach can be considered as creating data-silos in itself, since we already have 3 copies of the same data:
Transactional data-store DB
Replicated snowflake staging
Snowflake Data Warehouse DB
From a technical architecture point of view, is it a good idea to use Snowflake as a direct datastore for transactional application? (application that does many CRUD operations). That may help in avoiding the cost of replication and ingestion.
The main problem I see with this approach is that: Snowflake does not enforce any referential integrity (primary keys, foreign keys) so within the CRUD app, I have to either use a MERGE statement always or somehow make sure I don't create duplicate records.
The other problem being in the cloud, the distance (aka network) between the app and snowflake decides the performance of the transactions, I want good, consistent performance of my CRUD operations.
Any thoughts/suggestions are much appreciated.
Snowflake as of today does not perform well with singleton updates and inserts, which is what we see mostly with transactional databases. I have seen a performance degradation when using singleton inserts are submitted against Snowflake.
On the contrary, they are very optimized for bulk ingestion of unstructured data and structured data though and are designed for OLAP warehouses. You can still use it but you may see the same performance degradation. Also, primary keys can be defined but they are not enforced.
In my opinion, if you are faced with that challenge, you have the option to use a Postgre SQL DB (open source) in the cloud as your transactional database and it acts as a good complement to Snowflake as the OLAP database.
No. Snowflake isn't good as a transactional / OLTP database for the reasons you've mentioned. Plus, it won't perform well with many individual CRUD operations due to how they structure the data (optimised for OLAP workloads).
Just want to point out that there are benefits to creating separate databases, for one you want to isolate your transactional database from that of your analytics database otherwise you could be significantly affect the performance of the application. Secondly, the data in the transactional database could change and if you had to reprocess the data for whatever reason you may not be able to do so. There are many more, but I will stop here :-)

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

Using a regular database as a data warehouse

Can anyone tell me what the implications are when attempting to use a regular database as a data warehouse?
I understand a data warehouse is known for storing data in a more structured manner however what's the implication of using a standard database to achieve the same result? Can we not just create a regular database table with structured data as it would reside in a data warehouse?
Data structure is not the issue - optimization is.
OLTP databases like SQLS are optimized to reliably record transactions. They store data as records, and extensively use disk I/O.
BI databases like Redshift or Teradata are optimized to query data. They store data as columns, and often are in-memory only (no disk I/O).
As a result, traditional databases are better at getting data in, while BI databases are better at getting data out (both platforms are trying to mitigate their weaknesses, so the difference is blurring).
Practically speaking, you can use regular databases like SQLS to build a data warehouse without any problems, unless your needs are special:
Data size is large (billions of records)
Refresh rate is high (hour/minute/real time)
You intend to use live connection from BI tools like Tableau or PowerBI (as opposed to loading data extract into them)
Your queries are highly complex and computationally intensive
You can also combine both platforms. Import, process, integrate and store data in a regular database, and then convert it into a star schema (dimensional model) and publish it to a BI database (i.e, keep normalized data in SQLS and publish star schema to Redshift).
If you intend to import data into BI tools like Tableau or PowerBI, then you can safely use any traditional database, because they rely on their internal engines and using BI database won't give you any advantages.
data warehouses also will have redundant or duplicate data in them, not really what you are looking for in a regular database

Hadoop and database

I am currently looking at an issue where I am trying to integrate hadoop with a database, since hadoop offers parallelism but not performance. I was referring the paper of hadoopDB. Hadoop usually takes a file and splits it into chunks and places these chunks in different data nodes. During processing the namenode tells the location where a chunk might be found and runs a map on that node. I am looking at a possiblility of the user telling the namenode which datanode to run the map on and the namenode either runs the map to get the data from a file or a database. Can you kindly tell me whether it is feasible to tell the namenode which datanode to run the map ?
Thanks!
Not sure why you would like to tie a map/reduce task to a particular node. What happens if that particular node goes down? In Hadoop the map/reduce operations cannot be tied to a particular node in the cluster that what makes Hadoop more scalable.
Also, you might want to take a look # Apache Sqoop for importing/exporting between Hadoop and Database.
If you are looking to query data from a distributed data store, then why don't you consider storing your data into Hbase which is a distributed data base built on top of Hadoop and HDFS. It stores data into HDFS in the background and gives query semantics like a big database. In that case you don't have to worry about issuing queries to the right data node. The query semantics of HBase (also known as hadoop database will take care of the same).
For easy querying and storing data into Hbase and if your data is timeseries data, then you can also consider using OpenTSDB which is a wrapper around Hbase and provides you with easy tag based query semantics as well as integrates nicely with GNUPlot, to give you graph like visualization of your data.
Hbase is very well suited for random reads/writes to a very large distributed data store however, if your queries operate on bulk writes/reads Hive maybe a well suited solution for your case. Similar to Hbase, it is also built on top of Hadoop Map Reduce and HDFS and converts each query to underlying map-reduce jobs. The best thing about Hive is that it provides SQL like semantics and you can query just like you would do on a relational database.
As far as organization of data and a basic introduction to the features of Hive is concerned you may like to go through the following points:
Hive adds structure to the data stored on HDFS. The schema of tables is stored in a separate metadata store. It converts SQL like semantics to multiple map reduce jobs running on HDFS in the backend.
Traditional databases follow the schema on write policy where once a schema is designed for a table, at the time of writing data itself, it is checked whether the data to be written conforms to the pre-defined schema. If it does not, the write is rejected.
In case of Hive, it is the opposite. It uses the schema on read policy. Both the policies have their own individual trade-offs. In case of schema on write, load time is more and loads are slower because schema conformance is verified at the time of loading data. However, it provides faster query time because it can index data based on predefined columns in the schema, however there may be cases where the indexing cannot be specified while populating the data initially and this is where schema on read comes in handy. It provides the option to have 2 different schema present on the same underlying data depending on the kind of analysis required.
Hive is well suited for bulk access, updates of data as a new update requires a completely new table to be constructed. Also, query time is slower as compared to traditional databases because of the absence of indexing.
Hive stores the metadata into a relational database called the “Metastore”.
There are 2 kinds of tables in Hive:
Managed tables - Where the data file for the table is predefined and is moved to the hive warehouse directory on HDFS (in general, or any other hadoop filesystem). When a table is deleted, in that case, the metadata and the data both are deleted from the filesystem.
External tables - Here you can create data into the table lazily. There is no data moved to the Hive warehouse directory in this case and the schema/metadata is loosely coupled to the actual data. When a table is deleted, only the metadata gets deleted and the actual data is left untouched. It becomes helpful in cases if you want the data to be used by multiple databases. Another reason of using the same maybe when you need multiple schemas on the same underlying data.

Resources