Snowflake database organization when you have multiple environments (dev/test/prod)? - snowflake-cloud-data-platform

I am considering 2 methods to organize my snowflake databases, and would like to get some opinions from the snowflake architecture masters out there. To me, Design 1 is better due to code portability across environments. Which is better?
Design 1:-
Separation of usage by schemas:
This first design allows for code portability across environments, without the need to reference database names in code between environments. Cloning also works well here. The downside here is that schemas (not databases) are the only separation of data usage. Unfortunately, snowflake does not support synonyms (aliases) which would allow me to use Design 2, with full code portability.
Data_Lake_Dev
schemas as needed for raw, base, transformed (Sales, Mkt, etc), and dimensional marts
Data_Lake_Test
schemas as needed for raw, base, transformed, and dimensional marts
Data_Lake_Prod
schemas as needed for raw, base, transformed, and dimension marts
Design 2:
Separation of usage by Databases:
This second design creates databases for separation of usage, which is much clearer then Design 1, but code is not portable across environments (except when using Dbt, which allows this via jinga code, creating aliases across dbs). In the below, you would repeat each, for _dev, _test, and _prod environments, so 12 databases vs 3 in the first design.
Raw_db_dev
raw ingestion
Base_db_dev
light transforms
Transform_db_Dev
Business area usage (Sales, Marketing, etc)
Analytics_db_Dev (data marts)

David - according to me, Code portability should not be a decision maker at all for organizing your snowflake databases. Snowsql provides variables/parameterization options. Follow the below link to overcome the issue related to code migration across multiple databases/envs/accounts.
https://docs.snowflake.com/en/user-guide/snowsql-use.html#using-variables
I'd personally prefer having 2 separate accounts - 1 for DEV/QA and another for Prod.

Related

Which Multi-tenant approach is recommended

As per Multi-Tenant Data Architecture post, there are 3 ways to implement multi-tenancy
Separate Databases
Shared Database, Separate Schemas
Shared Database, Shared Schema
I have following details:
User should be able to backup and restore their data.
No of tenants : 3 (approx)
Each tenant might belong to different domain(url).
There are some common tables for all the tenants.
No of tables in each tenant: 10 (initial)
I would like to know which approach is more suitable for me?
I think option 2 is the best one, but you still have an issue with requirement 1. Backup and restore is not available per schema. you will need to handle this using Import data or any custom tool. common tables will have separate schema.
In option 1, you need to handle requirement 4, common tables will be replicated between all databases.
The most important condition among all the 5 conditions is condition 4 - which says that some tables are common among all the tenant. If some tables are common then Separate Database (i.e. Option 1) is ruled out.
You could have gone ahead with option 2, shared database and separate schema but the number of tenants are quite less (3 in your case). In such a scenario adding the overhead of maintaining separate schemas is an overhead which should be avoided. Hence, we can skip Option 2 as well until we evaluate option 3.
Option 3 - Shared database with shared schema is will be most efficient option for you. It avoids the overhead of maintaining separate schemas and allows common tables among tenants. In shared schemas generally a tenant identifier is used across tables. Hibernate already has support for such tenant identifiers (just in case you will use Java-J2EE for implementation).
The only problem may be performance as putting the data of all three tenants together in the same table(s) will lead to lower database access\search performance which you will have to counter with de-normalization and indexing.
I would recommend going forward with option 3.
For us, we have developed an ERP in HR which is used by about a forty of client and a thousand's of users; The approach used for the multi-tenancy is the third.
In addition, one of the technical separation tables used was the heritance.
I had the same issue in my system. We had an ad-network system that became pretty large overtime, so I was considering migration to multi-tenant architecture per publisher.
Option 3 wasn't relevant as some publishers had special requirements (additional columns/procedures and we are not supporting partitioning). Option 2 wasn't relevant since we had concurrency and load issues between customers. In order to reach high performance, with plans to scale-out up to 1000 publishers, with 20 widgets and high concurrency, Our only solution was Option 1 - Separate Databases.
We already migrated to this mode, and we have 10 databases running, with a shared database for configuration and ads. From performance, it is great. From high availability, it is also very good as each publisher traffic doesn't effect all the other. Adding new publisher is an easy step for us, as we have our template environment. The only issue we have is Maintenance.
Lately I've been reading about PartitionDB, It looks very simple to manage as you can have a gate database to perform all maintenance works including upgrades and top level queries. It supports common shared database (same we have already), and I am trying now to understand how to use the stand alone database as well.

How to present a database design?

I am doing a project in the university and it includes a MySQL database. I have a design for the database in terms of a list of tables and their respective fields.
In what form should I present this design? Just the list of tables and content? In an ERD? How do you present your designs?
To clarify - whatever you answer, I expect not only specification of how you present your design, but also which tools do you use the create the diagrams/list/tables etc.
ERD is the only way to go. As they say, a picture is worth a thousand words.
But don't try to put the whole database on one diagram. It will, in all but the most trivial cases, be overwhelming to your audience to try to digest the entire database design in one go. Instead, break the diagrams into subject areas depicting only the most relevant tables in each diagram. For example, a point-of-sale system might have separate diagrams for Inventory, Sales, Accounting, Customer Management, Security, Auditing, and Reporting. Some tables will show up in more than one subject area -- this is to be expected.
As far as tooling, nothing beats ErWin, but it is really expensive and only available for Windows. Visio is ubiquitous in a corporate environment, but is only available on Windows and is not exactly cheap either. Macs offer some really nice diagramming tools; most of them are not free.
Dia is a decent, free, and cross-platform diagramming tool. It is a bit quirky, though; and I have not had much success making the diagrams look as nice I want them to look.
For MySQL, I have played with fabFORCE dbDesigner and it is not bad, but I did find its support for multiple subject areas to be a bit lacking at the time -- perhaps they've improved it since. But it is free and works on Windows and Linux.
For the actual presentation, I create images from these diagramming tools and pull them into presentation software (PowerPoint, KeyNote, or OpenOffice Impress). These presentations can be exported to PDF and distributed to the audience; they won't need anything more than a PDF viewer to review the information later.
Let's look at this from your professor's perspective. If I were him/her:
I would require an ERD. Without it, I cannot see one of the most fundamental issues of a database design, how are the tables related.
I would also expect some basic use cases/ requirements. What problems are you trying to solve with this database design?
I would want to see what indexes are in place, especiall on the foreign key columns. I would want to see expected row counts in all tables to determine if indexes are even required.
I would want to see column data types to determine if they meet the requirements. I would want to see what columns accept NULL values, since that often can cause problems if you're not careful.
If I were using SQL Server, I would probably create a diagram in SSMS to display a somewhat basic ERD. Visio can be used as well. I might use Visio to create my use cases, or perhaps Microsoft Word.
mysql workbench will make you pretty graphics for presentation amongst other many sophisticated features.
Depends on the audience. ERD certainly isn't the only answer and may not be the best. You should choose a medium that your audience will understand.
Don't forget to discuss design aspects that can't fit to ERD:
1) how inheritance/aggregation relationships from your analytical model implemented in your db.
2) how you are going to support hierarchies of your objects in the rdb (if you have any)
3) list relationships that are in your analytical model but are not supported by the rdb design.
4) ETL process, track changes, track schema changes, security based on resource.
5) storage partitioning and maintenance aspects (one of the goal optimize backup time)
6) in prod test (test island data) and easy cloning db for test environment

SQL Server architecture guidance

We are designing a new version of our existing product on a new schema.
Its an internal web application with possibly 100 concurrent users (max)This will run on a SQL Server 2008 database.
On of the discussion items recently is whether we should have a single database of split the database for performance reasons across 2 separate databases.
The database could grow anywhere from 50-100GB over 5 years.
We are Developers and not DBAs so it would be nice to get some general guidance.
[I know the answer is not simple as it depends on the schema, archiving policy, amount of data etc. ]
Option 1 Single Main Database
[This is my preferred option].
The plan would be to have all the tables in a single database and possibly to use file groups and partitioning to separate the data if required across multiple disks. [Use schema if appropriate]. This should deal with the performance concerns
One of the comments wrt this was that the a single server instance would still be processing this data so there would still be a processing bottle neck.
For reporting we could have a separate reporting DB but this is still being discussed.
Option 2 Split the database into 2 separate databases
DB1 - Customers, Accounts, Customer resources etc
DB2 - This would contain the bulk of the data [i.e. Vehicle tracking data, financial transaction tables etc].
These tables would typically contain a lot of data. [It could reside on a separate server if required]
This plan would involve keeping the main data in a smaller database [DB1] and retaining the [mainly] read only transaction type data in a separate DB [DB2]. The UI would mainly read from DB1 and thus be more responsive.
[I'm aware that this option makes it harder for Referential Integrity to be enforced.]
Points for consideration
As we are at the design stage we can at least make proper use of indexes to deal performance issues so thats why option 1 to me is attractive and its more of a standard approach.
For both options we are considering implementing an archiving database.
Apologies for the long Question. In summary the question is 1 DB or 2?
Thanks in advance,
Liam
Option 1 in my opinion is the way to go.
CPU is very unlikely to be your bottleneck with 100 concurrent users providing your workload. You could acquire a single multi-socket server with additional CPU capacity available via hot swap technology to offer room to grow should you wish. Dependent on your availability requirements you could also consider using a Clustering solution to allow for swapping in more processing CPU resource by forced fail over to another node.
The performance of your disk subsystem is going to be your biggest concern. Your design decisions will be influenced by the storage solution you use, which I assume will be SAN technology.
As a minimum you will want to place your LOG(RAID 1) and DATA files(RAID 10 or 5 dependent on workload) on separate LUNS.
Dependent on your table access you may wish to consider placing different Filegroups on separate LUN's. Partitioning your table data could prove advantageous to you but only for large tables.
50 to 100GB and 100 users is a pretty small database by most standards today. Don't over engineer your solution by trying to solve problems that you haven't even seen yet. Splitting it into two databases, especially on two different servers will create a mountain of headaches that you're better off without. Concentrate your efforts on creating a useful product instead.
I agree to the other comments stating that between 50 and 100GB is small these days. I'd also agree that you shouldn't overengineer.
But, if there is a obvious (or not so obvious) logical separation between the entities you store (like you say, one being read-write and the other parts mainly read-only), I'd still split it in different dbs. At least I would design it in a way I could easily factor one piece out. Security would be one reason, management/backup/restore another, easier serviceability (because inherently the design will be better factored and parts better isolated from each other), and, in SQL Server, ability to scale out (or the lack thereof if it is a single database). Separating login and content databases for example often makes sense for bigger web applications.
And, if you really want a sound design, separate your entities in a single db, using different schemas, putting proper permissions on objects, you end up with almost the same effort in my eyes.
Microsoft products like SharePoint, TFS and BizTalk all use several different databases (Though I do not pretend to be aware of the reasons / probably just the outcome of the way they organize their teams).
Especially with regard to that you cannot scale out a single database instance on SQL Server (clustering needs multiple instances), I'd be tempted to split it.
#John: I would never use RAID5. Solves no purpose other than to hurt performance. I agree with the RAID10 approach.
Putting data in another database is not going to make the slightest difference to performance. Performance is a factor of other things entirely.
A reason to create a new database is for maintenance and administration reasons. For example if one set of data needs a different backup and recovery policy or has higher availability requirements.

What exactly is NoSQL?

What exactly is NoSQL? Is it database systems that only work with {key:value} pairs?
As far as I know MemCache is one of such database systems, am I right?
What other popular NoSQL databases are there and where exactly are they useful?
Thanks, Boda Cydo.
I'm not agree with the answers I'm seeing, although it's true that NoSQL solutions tends to break the ACID rules, not all are created from that approach.
I think first you should define what is a SQL Solution and then you can put the "Not Only" in front of it, that will be more accurate definition of what is a NoSQL solution.
With this approach in mind:
SQL databases are a way to group all the data stores that are accessible using Structured Query Language as the main (and most of the time only) way to communicate with them, this means it requires that the database support the structures that are common to those systems like "tables", "columns", "rows", "relationships", etc.
Now, put the "Not Only" in front of the last sentence and you will get a definition of what means "NoSQL". NoSQL groups all the stores created as an attempt to solve problems which cannot fit into the table/column/rows structures or even in SQL Statements, in most of the cases these databases will not support relationships, they're abandoning the well known structures just because the problems have changed since their conception.
If you have a text file, and you create an API to store/retrieve/organize this information, then you have a NoSQL database in your hands.
All of these means that there are several solutions to store the information in a way that traditional SQL systems will not allow to achieve better performance, flexibility, etc etc. Every NoSQL provider tries to solve a different problem and that's why you wont be able to compare two different solutions, for example:
djondb is a document store created to be used as
NoSQL enterprise solution supporting transactions, consistency, etc.
but sacrifice performance of its counterparts.
MongoDB is a document store (similar to
djondb) which accomplish great performance but trades some of the
ACID properties to achieve this.
CouchDB is another document store which
solves the queries slightly different providing views to retrieve the
information without doing a full query every time.
...
As you may have noticed I only talked about the document stores, that's because I wanted to show you that 3 different document stores implementations have different approach, therefore you should keep in mind the golden rule of NoSQL stores "Use the right tool for the right job".
I'm the creator of djondb and I've been doing a lot of research even before trying to start my own NoSQL implementation, but this is a field where the concepts will keep changing the way we see the information storage.
From wikipedia:
NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularised in early 2009.
The motivation for such an architecture was high scalability, to support sites such as Facebook, advertising.com, etc...
To quickly get a handle on NoSQL systems, see this blog post I wrote: Visual Guide to NoSQL Systems. Essentially, NoSQL systems sacrifice either consistency or availability in favor of tolerance to network partitions.
What is NoSQL ?
NoSQL is the acronym for Not Only SQL. The basic qualities of NoSQL databases are schemaless, distributed and horizontally scalable on commodity hardware. The NoSQL databases offers variety of functions to solve various problems with variety of data types, where “blob” used to be the only data type in RDBMS to store unstructured data.
1 Dynamic Schema
NoSQL databases allows schema to be flexible. New columns can be added anytime. Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of product life cycle.
2 Variety of Data
NoSQL databases support any type of data. It supports structured, semi-structured and unstructured data to be stored. Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing. So it reduces the need for ETL (Extract – Transform – Load).
3 High Availability Cluster
NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.
4 Open Source
NoSQL databases are open source software. The usage of software is free and most of them are free to use in commercial products. The open sources codebase can be modified to solve the business needs. There are minor variations in the open source software licenses, users must be aware of license agreements.
5 NoSQL – Not Only SQL
NoSQL databases not only depend SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These are APIs are move developer friendly and supported in variety of programming languages.
Take a look at these:
http://en.wikipedia.org/wiki/Nosql#List_of_NoSQL_open_source_projects
and this:
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
I used something called the Raima Data Manager more than a dozen years ago, that qualifies as NoSQL. It calls itself a "Set Oriented Database" Its not based on tables, and there is no query "language", just an C API for asking for subsets.
It's fast and easier to work with in C/C++ and SQL, there's no building up strings to pass to a query interpreter and the data comes back as an enumerable object rather than as an array. variable sized records are normal and don't waste space. I never saw the source code, but there were some hints at the interface that internally, the code used pointers a lot.
I'm not sure that the product I used is even sold anymore, but the company is still around.
MongoDB looks interesting, SourceForge is now using it.
I listened to a podcast with a team member. The idea with NoSQL isn't so much to replace SQL as it is to provide a solution for problems that aren't solved well with traditional RDBMS. As mentioned elsewhere, they are faster and scale better at the cost of reliability and atomicity (different solutions to different degrees). You wouldn't want to use one for a financial system, but a document based system would work great.
Here is a comprehensive list of NoSQL Databases: http://nosql-database.org/.
I'm glad that you have had success with RDM John! I work at Raima so it's great to hear feedback. For those looking for more information, here are a couple of resources:
Video Overview of RDM's General Architecture
Free Evaluation Download of RDM

Star-Schema Design [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Is a Star-Schema design essential to a data warehouse? Or can you do data warehousing with another design pattern?
Using star schemas for a data warehouse system gets you several benefits and in most cases it is appropriate to use them for the top layer. You may also have an operational data store (ODS) - a normalised structure that holds 'current state' and facilitates operations such as data conformation. However there are reasonable situations where this is not desirable. I've had occasion to build systems with and without ODS layers, and had specific reasons for the choice of architecture in each case.
Without going into the subtlties of data warehouse architecture or starting a Kimball vs. Inmon flame war the main benefits of a star schema are:
Most database management systems
have facilities in the query optimiser
to do 'Star Transformations' that
use Bitmap Index structures or
Index Intersection for fast
predicate resolution. This means that selection from a star schema can be done without hitting the fact table (which is usually much bigger than the indexes) until the selection is resolved.
Partitioning a star schema is relatively straightforward as only the fact table needs to be partitioned (unless you have some biblically large dimensions). Partition elimination means that the query optimiser can ignore patitions that could not possibly participate in the query results, which saves on I/O.
Slowly changing dimensions are much easier to implement on a star schema than a snowflake.
The schema is easier to understand and tends to involve less joins than a snowflake or E-R schema. Your reporting team will love you for this
Star schemas are much easier to use and (more importantly) make perform well with ad-hoc query tools such as Business Objects or Report Builder. As a developer you have very little control over the SQL generated by these tools so you need to give the query optimiser as much help as possible. Star schemas give the query optimiser relatively little opportunity to get it wrong.
Typically your reporting layer would use star schemas unless you have a specific reason not to. If you have multiple source systems you may want to implement an Operational Data Store with a normalised or snowflake schema to accumulate the data. This is easier because an ODS typically does not do history. Historical state is tracked in star schemas where this is much easier to do than with normalised structures. A normalised or snowflaked Operational Data Store reflects 'current' state and does not hold a historical view over and above any that is inherent in the data.
ODS load processes are concerned with data scrubbing and conforming, which is easier to do with a normalised structure. Once you have clean data in an ODS, dimension and fact loads can track history (changes over time) with generic or relatively simple mechanisms relatively simply; this is much easier to do with a star schema, Many ETL tools (for example) provide built-in facilities for slowly changing dimensions and implementing a generic mechanism is relatively straightforward.
Layering the system in this way providies a separation of responsibilities - business and data cleansing logic is dealt with in the ODS and the star schema loads deal with historical state.
There is an ongoing debate in the datawarehousing litterature about where in the datawarehouse-architecture the Star-Schema design should be applied.
In short Kimball advocates very highly for using only the Star-Schema design in the datawarehouse, while Inmon first wants to build an Enterprise Datawarehouse using normalized 3NF design and later use the Star-Schema design in the datamarts.
In addition here to you could also say that Snowflake schema design is another approach.
A fourth design could be the Data Vault Modeling approach.
Star schemas are used to enable high speed access to large volumes of data. The high performance is enabled by reducing the amount of joins needed to satsify any query that may be made against the subject area. This is done by allowing data redundancy in dimension tables.
You have to remember that the star schema is a pattern for the top layer for the warehouse. All models also involve staging schemas at the bottom of the warehouse stack, and some also include a persistant transformed merged staging area where all source systems are merged into a 3NF modelled schema. The various subject areas sit above this.
Alternatives to star schemas at the top level include a variation, which is a snowflake schema. A new method that may bear out some investigation as well is Data Vault Modelling proposed by Dan Linstedt.
The thing about star schemas is they are a natural model for the kinds of things most people want to do with a data warehouse. For instance it is easy to produce reports with different levels of granularity (month or day or year for example). It is also efficient to insert typical business data into a star schema, again a common and important feature of a data warehouse.
You certainly can use any kind of database you want but unless you know your business domain very well it is likely that your reports will not run as efficiently as they could if you had used a star schema.
Star schemas are a natural fit for the last layer of a data warehouse. How you get there is another question. As far as I know, there are two big camps, those of Bill Inmon and Ralph Kimball. You might want to look at the theories of these two guys if/when you decide to go with a star.
Also, some reporting tools really like the star schema setup. If you are locked into a specific reporting tool, that might drive what the reporting mart looks like in your warehouse.
Star schema is a logical data model for relational databases that fits the regular data warehousing needs; if the relational environment is given, a star or a snowflake schema will be a good design pattern, hard-wired in lots of DW design methodologies.
There are however other than relational database engines too, and they can be used for efficient data warehousing. Multidimensional storage engines might be very fast for OLAP tasks (TM1 eg.); we can not apply star schema design in this case. Other examples requiring special logical models include XML databases or column-oriented databases (eg. the experimental C-store)).
It's possible to do without. However, you will make life hard for yourself -- your organization will want to use standard tools that live on top of DWs, and those tools will expect a star schema -- a lot of effort will be spent fitting a square peg in a round hole.
A lot of database-level optimizations assume that you have a star schema; you will spend a lot of time optimizing and restructuring to get the DB to do "the right thing" with your not-quite-star layout.
Make sure that the pros outweigh the cons..
(Does it sound like I've been there before?)
-D
There are three problems we need to solve.
1) How to get the data out of the operational source system without putting undue pressure on them by joining tables within and between them, cleaning data as we extract, creating derivations etc.
2) How to merge data from disparate sources - some legacy, some file based, from different departments into an integral, accurate, efficiently stored whole that models the business, and does not reflect the structures of the source systems. Remember, systems change / are replaced relatively quickly, but the basic model of the business changes slowly.
3) How to structure the data to meet specific analytical and reporting requirements for particular people/departments in the business as quickly and accurately as possible.
The solution to these three very different problems require different architectural layers to solve them
Staging Layer
We replicate the structures of the sources, but only changed data from the sources are loaded each night. once the data is taken from the staging layer into the next layer, the data is dropped. Queries are single table queries with a simple data_time filter. Very little effect on the source.
Enterprise Layer
This is a business oriented 3rd normal form database. Data is extracted (and afterward dropped) from the staging layer into the enterprise layer, where it is cleaned, integrated and normalised.
Presentation (Star Schema) Layer
Here, we model dimensionally to meet specific requirements. Data is deliberately de-normalise to reduce the number of joins. Hierarchies that may occupy several tables in the Enterprise Layer are collapsed into a single dimension tables, and multiple transactional tables may be merged into single fact tables.
You always face these three problems. If you choose to do away with the enterprise layer, you still have to solve the second problem, but you have to do it in the star schema layer, and in my view, this is the wrong place to do it.

Resources