Constraints in snowflake database and performance - snowflake-cloud-data-platform

I've read the docs on Snowflake and realize Snowflake does not enforce constraints, except for not nulls. However, does Snowflake use these constraints in any way to help optimize queries?

Check this Snowflake post by Kent Graziano:
RI (Referential Integrity) Constraints: 3 Reasons to Include Them in Your Data Warehouse
Basically these constraints are useful even when the database doesn't use the constraints, for reasons like:
Design Metadata (people and code trying to understand your schema)
BI Metadata (tools like Looker and Tableau can leverage this information)
Quality assurance (run checks to verify that the design is being respected)

Related

Table Relationships - Access Front End with SQL Server Backend

When our IT department converts Access databases to SQL Server the relationships do not transfer over. In the past, I have provided ERDs that they can use to build the relationships. In this case, I didn't.
What are the possible consequences of defining the table relationships in the MS Access Front End versus on the SQL Server itself?
It would be ideal if I could just create the relationships in Access and avoid submitting a request to IT, but I don't want to risk performance issues now or in the future.
There may be some misconceptions.
A relationship in SQL Server enforces referential integrity (an order cannot have a customer ID that doesn't exist). It does not automatically create an index on the Foreign Key, so it has per se no impact on performance.
But in most cases it is a good idea to define an index on a foreign key, to improve performance.
A relationship that you define in Access on linked tables does neither. It cannot enforce referential integrity (that's the server's job).
It is merely a "hint" that the tables are related via the specified fields, e.g., so that the Query Builder can automatically join the tables if they are added to the query design. (copied from here)
So you should
Create the relationships in SQL Server to avoid inconsistent data. ("But my application logic prevents that!", I hear you say. Well, applications have bugs.)
Create indexes on foreign keys where appropriate to avoid performance problems.
If you are working with queries in the Access frontend, additionally define the relationships there.
Ideally you should have a test server where you can yourself define the relationships, and just send the finished SQL script to IT.

Column store in data warehouse

I have a question about data warehousing and column oriented databases. In my project the company use a warehouse solution in visual studio SQL server, they have troubles with the performance when querying complex questions on large amount of data. I want to try to replace the database with a columnar based database. I know that you can "transform" a row oriented database in to more column based or use an open source database such as Vertica or Sybase IQ, i just wondering how it would fit in the warehouse? Do you have to have a star join schema in a warehouse or can you use the columnar approach instead, i realize this is kind of a stupid question but im just trying to understand it all before i start to explore the different databases and solutions.
I know that SQL Server 2012 have a column store but i would like to try the other open source databases as well.
Thanks in advance!
Do you have to have a star join schema in a warehouse or can you use the columnar approach instead?
The star join schema consists of the table definitions of your data warehouse. The star schema, and similar schema, trade query performance for query flexibility. Usually, query flexibility is more important than query performance in a data warehouse.
Based on the Wikipedia article you linked to in your comments, a column oriented database engine stores the actual database bytes in column order, rather than the traditional row order of relational databases.
As the article says, this can improve disk access performance.
The star schema is how you define tables. A column oriented database engine is concerned with how the database information is written to disk. The two concepts have nothing to do with one another, except that they both apply to a data warehouse.
Keep your present data warehouse schema, and see if a column oriented database engine will improve query performance.

Relationships between tables from different databases

Is it possible to define relationships between tables in different databases in SQL server 2008? And can you recommend an online tutorial for studying it? (I prefer ASP.NET, C#)
No, you can't have foreign keys between databases.
Data integrity is within a single database only. If you need transactional consistency across databases then you should use a single database. The main issue is backups/restores: you will end up with broken data after a restore because your backups are not consistent.
A recent blog article "One Database or Ten?" explains in more details
Saying that, you can use triggers if you need this and are prepared to have broken data
Yes you can but NOT using FOREIGN KEYS:
You can use specific stored procs, which checks the consistency - in
this case you have to make the user to use only these procedures for
all the CRUD operations in both DBS
Triggers, which will check the same
All of the above have to run within properly isolated transaction to
be sure, that your "just checked" values will not be deleted in a
moment

Grouping ETL Staging Tables With User Schemas?

I was thinking of putting staging tables and stored procedures that update those tables into their own schema. Such that when importing data from SomeTable to the datawarehouse, I would run a Initial.StageSomeTable procedure which would insert the data into the Initial.SomeTable table. This way all the procs and tables dealing with the Initial staging are grouped together. Then I'd have a Validation schema for that stage of the ETL, etc.
This seems cleaner than trying to uniquely name all these very similar tables, since each table will have multiple instances of itself throughout the staging process.
Question: Is using a user schema to group tables/procs/views together an appropriate use of user schemas in MS SQL Server? Or are user schemas supposed to be used for security, such as grouping permissions together for objects?
This is actually a recommended practice. Take a look at the Microsoft Business Intelligence ETL Design Practices from the Project Real. You will find (download doc from the first link) that they use quite a few schemata to group and identify objects in the warehouse.
In addition to dbo and etl, they also use admin, audit, part, olap and a few more.
I think it's appropriate enough, it doesn't really matter, you could use another database if you liked which is actually what we do.
I'm not sure why you would want a validation schema though, what are you going to do there?
Both the reasons you list (purpose/intent, security) are valid reasons to use schemas. Once you start using them, you should always specify schema when referencing an object (although I'm lazy and never specify dbo).
One trick we use is to have the same-named table in each of several schemas, combined with table partitioning (available in SQL 2005 and up). Load the data in first schema, then when it's validated "swap" the partition into dbo--after swapping the dbo partition into a "dumpster" schema copy of the table. Net Production downtime is measured in seconds, and it's all carefully wrapped in a declared transaction.

SQL Server Replication (cross-database queries & constraints)

We want to replicate data from one database to several others (on another server). Would it make sense to replicate these tables to a shared database on the other server and have our cross-database queries reference the shared database... or would it make more sense to replicate out to each individual database on the other server? Would cross database joins pose a performance hit? Would cross-database constraints work as expected?
Replicating once to a shared database would help replication performance... I'm trying to evaluate whether or not any performance hit as a result of cross-database queries or constraints would be worth it.
Edit: It looks like cross database constraints are not possible in sql server? If this is true then we would have to replicate to each database
Cross database queries are somewhat slower that within the same DB. Foreign keys work within the same DB only. Usual approach is to create a separate schema in each DB (like ETL) and then replicate those tables to that schema. This approach is actually frequently used when replicating dimension tables between data marts.
When using cross-db approach, use triggers to implement constraints -- may be slow and complicated.
Depending on your application, you may implement foreign keys as "logical only" and run periodic "look for orphans" queries to deal with referential integrity.

Resources