I'm designing a Dataware house in Azure Synapse using SQL Pool, but I'm facing some design questions.
Context: My plan is to load Partitioned Parquet files using Azure Data Lake Storage (ADLS), then, with SQL pool create External Tables to query those files.
My questions are:
Is it better in terms of performance to provide the solution just with the external tables? that is, with no create internal tables neither CTAS, BCP, or copy methods from the ADLS to storage in the database.
Is it possible to perform partitioning in external tables? is it enough to organize the parquet by folders named by date?
How does affect the user concurrency to the external tables and the internal tables? some experienced recommendations?.
Thanks for your time.
Josh
Is it better in terms of performance to provide the solution just with the external tables?
No. Internal Tables are distributed columnstores, with multiple levels of caching, and typically out-perform external parquet tables. Internal tables additionally support batch-mode scanning, columnstore ordering, segment elimination, partition elimination, materialized views, and resultset caching.
Is it possible to perform partitioning in external tables?
This is not currently possible in Dedicated SQL Pools, see Folder Partition Elimination
How does affect the user concurrency to the external tables and the internal tables?
Concurrency is a matter of query performance. The faster your queries perform, the faster sessions give up their concurrency slot. So anything that improves query performance improves the effective concurrency (the number of concurrent users you can support with reasonable query runtime).
Serverless SQL Pools currently have more advanced capabilities for working with data stored as Parquet or Delta in the Data Lake.
Related
Currently, I generate data on a different datastore and replicate to Snowflake Staging, then that data moves to the Data Warehouse DB through ELT ingestion for Analytics purpose. However this approach can be considered as creating data-silos in itself, since we already have 3 copies of the same data:
Transactional data-store DB
Replicated snowflake staging
Snowflake Data Warehouse DB
From a technical architecture point of view, is it a good idea to use Snowflake as a direct datastore for transactional application? (application that does many CRUD operations). That may help in avoiding the cost of replication and ingestion.
The main problem I see with this approach is that: Snowflake does not enforce any referential integrity (primary keys, foreign keys) so within the CRUD app, I have to either use a MERGE statement always or somehow make sure I don't create duplicate records.
The other problem being in the cloud, the distance (aka network) between the app and snowflake decides the performance of the transactions, I want good, consistent performance of my CRUD operations.
Any thoughts/suggestions are much appreciated.
Snowflake as of today does not perform well with singleton updates and inserts, which is what we see mostly with transactional databases. I have seen a performance degradation when using singleton inserts are submitted against Snowflake.
On the contrary, they are very optimized for bulk ingestion of unstructured data and structured data though and are designed for OLAP warehouses. You can still use it but you may see the same performance degradation. Also, primary keys can be defined but they are not enforced.
In my opinion, if you are faced with that challenge, you have the option to use a Postgre SQL DB (open source) in the cloud as your transactional database and it acts as a good complement to Snowflake as the OLAP database.
No. Snowflake isn't good as a transactional / OLTP database for the reasons you've mentioned. Plus, it won't perform well with many individual CRUD operations due to how they structure the data (optimised for OLAP workloads).
Just want to point out that there are benefits to creating separate databases, for one you want to isolate your transactional database from that of your analytics database otherwise you could be significantly affect the performance of the application. Secondly, the data in the transactional database could change and if you had to reprocess the data for whatever reason you may not be able to do so. There are many more, but I will stop here :-)
I am new to snowflake and was exploring on snowflake on AWS. When the data is stored in snowflake , i understood that we can create and manage data in partitions similar to what we do in hive. Hive doesn't allow me to have partition level user access management. Can I do that with snowflake ? if yes , how do we do and how its managed on storage layer on AWS?
With Snowflake, you have no direct access to the underlying storage, you can only use the access mechanisms that Snowflake provides. Snowflake manages all the provision, management and layout of your data on the underlying storage entirely transparently. So you can't "create and manage data in partitions similar to what we do in hive"
If you want to understand more about how this storage works you can read about micro-partitioning here
In the vast majority of cases there is no need to interfere with how Snowflake is laying out your data but there is the functionality available to force how the data is clustered - though Snowflake suggests that this is only ever useful on multi-terabyte tables. You can read about clustering tables here
Snowflake does have the concept of "External Tables" - these are tables that appear in the Snowflake DBs as normal tables but their data is actually held in S3 (or Azure Blob or GCP storage) that you own and manage rather than Snowflake. These tables can be convenient to create/use but perform significantly worse than tables held directly in Snowflake: when the data is loaded into Snowflake it might be still ultimately stored on S3 but it is compressed, converted into columnar format and held in micro-partitions - so very different in structure to the files you can see in your S3 buckets
I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.
We have a requirement wherein we want our business users to manipulate data in Snowflake through some UI interface (which might require creating additional reference tables etc)
Is it a good practice as Snowflake is for DW purpose and not for transactional data and are there any performance issues in doing so?
Typical actions could be filtering data, searching for particular IDs, updating/deleting certain row(s), etc. For these activities wanted to know if it's cost effective?
Yeah your point is correct, Snowflake db will suit DW workload better than the transactional one.
Having said that, we must also note that the Snowflake databases support a lot of features provided by traditional OLTP databases. For example, honouring ACID properties, transactional consistency, object recovery from accidental drops, read-only database copies for reporting purposes, data encryption, Role based access control, Secure Views, Materialized Views, support for semi-structured data, supportability of a wide variety of function such as Scalar Functions, Aggregation Functions, Window Functions, Table Functions, System Functions as well as support for External Function and customized User-defined Functions (UDFs) etc.
It also provides connectivity interface (driver/connectors) to connect to a variety of different database systems, big-data eco-systems as well as analytical tools.
The Snowflake engine also offers an amazing level of query performance and it has the ability to add dynamic compute power for higher concurrency and greater performance.
So if you are looking for a database for good query performance and doing operations such as querying data, filtering data and routine updates and deletes then it will serve the purpose.
But if you planning to create constraints on tables, PK/FK relationships between tables, indexes, a lot of single row Inserts, any isolation level apart from Read Committed, transaction management through a stored procedure etc. then it may not be a natural choice.
There is a concept of table clustering in place of Indexes. Single row Inserts must be converted into 'COPY Into Table' commands to reduce throttling and to get better performance. Primary Keys / Foreign Keys can be created but they are not enforced.
We have a normalized SQL Server 2008 database designed using generic tables. So, instead of having a separate table for each entity (e.g. Products, Orders, OrderItems, etc), we have generic tables (Entities, Instances, Relationships, Attributes, etc).
We have decided to have a separate denormalized database for quick retrieval of data. Could you please advise me of various technologies out there to synchronize these 2 databases, assuming they have different schemas?
Cheers,
Mosh
When two databases have so radically different schemas you should be looking at techniques for data migration or replication, not synchronization. SQL Server provides two technologies for this, SSIS and Replication, or you can write your own script to do this.
Replication will take new or modified data from a source database and copy it to a target database. It provides mechanisms for scheduling, packaging and distributing changes and can handle both real-time as well as batch updates. To work it needs to add enough info in both databases to track modifications and matching rows. In your case it would be hard to identify which "Products" have changed as you would have to identify all relevant modified rows in 4 or more different tables. It can be done but it will require some effort. In any case, you would have to create views that match the target schema, as replication doesn't allow any transformation of the source data.
SSIS will pull data from one source, transform it and push it to a target. It has no built-in mechanisms for tracking changes so you will have to add fields to your tables to track changes. It is strictly a batch process that can run according to a schedule. The main benefit is that you can perform a wide variety of transformations while replication allows almost none (apart from drawing the data from a view). You could create dataflows that modify only the relevant Product field when a Product related Attribute record changes, or simply reconstitute an entire Product record and overwrite the target record.
Finally, you can create your own triggers or stored procedures that will run when the data changes and copy it from one database to the other.
I should also point out that you have probably over-normalized your database. In all three cases you will have some performance penalty when you join all tables to reconstitute an entity, resulting in a larger amount of locking that is necessary and inefficient use of indexes. You are sacrificing performance and scalability for the sake of ease of change.
Perhaps you should take a look at the Sparse Column feature of SQL Server 2008 for a way to support flexible schemas while maintaining performance and scalability.