how to organize data marts in data warehouse - data-modeling

I am standing up a new enterprise data warehouse for my company, using Kimball methodology (i.e., a collection of data marts). I'd like to know the best practices (or usual practices) for organizing my data marts.
1) Should each data mart be a separate database on the EDW server? Or, should each data mart be a schema of a single database?
2) For conformed dimensions (i.e., dimensions that apply to 2+ data marts / subject areas / business processes), should they live in a separate schema or database? Or, because we won't know in advance what dimensions will be conformed (since we are building a data mart at a time), should we simply identify the conformed dimensions in our enterprise bus matrix (Excel file) and make no effort to segregate them in the EDW?
3)
a) Should fact tables and dimension tables be identified at all in the EDW? For example, since I will be maintaining a diagram of each star schema that will be shared with self-service BI users, is there any value in identifying fact tables in the DB via some method, say prefixing the table name with 'Fact'?
b) If fact and dimension tables should be identified in the EDW, what should be the identification mechanism? Should it be via table name prefixing? Should it be via organizing the tables into separate 'Fact' and 'Dimension' schemas?

1) Should each data mart be a separate database on the EDW server? Or,
should each data mart be a schema of a single database?
This (also) depends on what database software you are using and whether it imposes any kind of limitation on, for example, using data across multiple schemas.
In any case, you'll inevitably need to connect to, and query/join data from, different data marts, to address some business cases or even ETL processes. You may also need to segregate/secure access to specific data marts, load each data mart independently or using different schedules/methods, etc.
For these reasons, it is usually good enough to keep the data warehouse in one database organized into schemas: one schema per data mart plus specific schemas for shared objects (like conformed dimensions). This way you can still use data that is scattered across multiple data marts, easily control access to specific schemas / data marts, and it'll be easier for users to locate specific metrics/facts.
2) For conformed dimensions (i.e., dimensions that apply to 2+ data
marts / subject areas / business processes), should they live in a
separate schema or database? Or, because we won't know in advance what
dimensions will be conformed (since we are building a data mart at a
time), should we simply identify the conformed dimensions in our
enterprise bus matrix (Excel file) and make no effort to segregate
them in the EDW?
If you organize data marts into schemas, it makes sense to have a specific schema to hold these conformed dimensions and other shared data. This way, different users that may have access only to specific data marts can still use the conformed/shared dimensions.
3)
a) Should fact tables and dimension tables be identified at all in the
EDW? For example, since I will be maintaining a diagram of each star
schema that will be shared with self-service BI users, is there any
value in identifying fact tables in the DB via some method, say
prefixing the table name with 'Fact'?
Yes, using prefixes makes it easier to locate metrics (facts) and dimensions when users are browsing the data warehouse, something like F_tableName or D_tableName would already go a long way.
b) If fact and dimension tables should be identified in the EDW, what
should be the identification mechanism? Should it be via table name
prefixing? Should it be via organizing the tables into separate 'Fact'
and 'Dimension' schemas?
Same as above :)

Related

What is the difference between a schema & a database in Snowflake?

Is there a good reason to start a new project in a fresh Snowflake schema vs. a fresh Snowflake database?
I know this sounds like an opinion based question, but I'm trying to get to the technical limitations of one vs. the other.
As far as I can tell, databases & schemas are just like folders and sub-folders. They seem to have no bearing on cost or capability.
I can do:
SELECT *
FROM database1.schemaA.tableX x
JOIN database2.schemaB.tableY y ON y.row_id = x.row_id
So is it all purely syntax and table organization? Or am I missing something?
For simple use cases, you can treat databases and schemas as folders and subfolders. How you set them up is determined on how you want to organise your data and how you want to manage access control.
Access control: the more granular you want to make your access control the more complicated it is to implement and maintain. It's relatively simple to give users access to everything in a database, it's more complicated to give users access to specific schemas within a database and it can get very complicated to give users access to a subset of tables within a schema. Therefore if you have sets of tables that should be accessible to different set of users it is easier if you keep them in different schemas (or databases).
Replication: if you are going to need to replicate data to another Snowflake account (presumably in another region or otherwise you would probably use Sharing not Replication) then bear in mind that replication happens at the database level i.e. you can't replicate specific schemas (or tables or views), the whole database gets replicated. This may influence how you segregate your data between databases

Data modeling in snowflake

We perform ELT in our company. we load the data to the landing zone (which is a database in snowflake) and have schemas as per the source from which it is retrieving the data such as:
LZ(database) -- FACEBOOK<LINKEDIN(schemas)
(Here nothing needs to be changed)
Once all the data is loaded, analysts create views/tasks to do the transformations as per the information needed.
We are moving towards the domain-driven design in snowflake in later part. We have analysts and each analyst belongs to a domain such as sales, and vendor.
We have identified all domains now next step is implementation. There are two ways:
domains as databases
domains as schemas inside a single database
We can have a sales database, a vendor database.
Or we can have a database such as analysts: inside which sales could be a schema and vendor could be a schema.
Which one should I go for and why? I have seen in most cases its schemas only but what could work best and why and what could be the implications is what I am looking for

Best Options to manage large sets of data SQlserver

I am currently working on a project which involves the following:
The application I am working on is connected to a SQlserver
database.
SAP loads information into multiple tables (in a daily
and also hourly basis) into a MASTER database
There are 5 other databases(hosted on the same server) that access this information via synonyms and stored procedure calls to the MASTER database
The MASTER database purely used for storing the data and routing it to the other databases)
Master Database -
Tables:
MASTER_TABLE1 <------- SAP inserts data into this table.Triggers are used to process the valid data & insert into secondary staging tables -say MASTER_TABLE1_SEC
MASTER_TABLE1_SEC -- Holds processed data coming into MASTER_TABLE1
FIVE other databases ( for each manufacturing facility) are present in the same server. My application is connected to the facility databases ( not the Master)
FACILITY1
Facility2
....
FACILITY5
Synonyms of MASTER_TABLE1_SEC are created in each of these 5 facility databases
Stored procedures are again called from the Facility databases- in order to load data from the MASTER_TABLE1_SEC into the respective tables( within EACH facility) based on the business logic.
Is there a better architecture to handle this kind of a project? I am a beginner when it comes to advanced data management. Can anyone suggest a better architecture or tools to handle this?
There are a lot of patterns that would actually meet the needs described here. It serves that you are working with a type of Data Warehouse. I use Data Vault for my Enterprise Data Warehouses. It is an Ensemble Modeling technique designed for integration and master data preparation. You can think of it as a way to house all data from all time. You would then generate Data Marts (Kimball Method) for each of the Facilities containing only thei or whatever is required for their needs.

What are the implications of creating tables in a database with different schemas?

I am creating a database with about 40 different tables.
I have heard about people grouping tables into database 'schemas' - what are the implications of using different schemas in a database? Can tables from one schema still relate to another schema? What are the functional differences between different schemas?
Where are schemas located in SSMS? They are rightfully placed under the security tab.
Lets use the AdventureWorks databases.
If you assign security at the schema level, purchasing users will only have access to the purchasing table and sales people will have only access to the sales tables.
In fact, they will not even see the other tables if you set it up correctly.
If you combine schemas with creating tables/indexes on file groups, now you can place all the sales people onto file group sales and purchasing on file group purchasing.
IE - Spreading the I/O load.
In short, I think schemas are an unknown and little used feature.
Check out my blog article on this fact.
http://craftydba.com/?p=4326
I assume that you are talking about SQL Server. You can join and reference between tables in different schemas. I see it mostly used for visual organization and/or for managing objects' permission (you can assign permissions at the schema-level).
If you are worried about any negative effects of doing dbo.table vs custom.table - there aren't any that I imagine you would encounter.
Schemas are just collections of database objects. They are useful for maintaining separation of sets of objects.
There is always at least one schema. For SQL Server it is named dbo.
One implication of having multiple schemas is that you will have to manage permissions for the various schemas. This is usually done via a role that's associated with the schema.
Objects in one schema are available to objects from another, and there is no performance penalty in writing queries that use objects from multiple schemas.

Structuring the data warehouse?

I need an advice for creating my DW. I have some experience in creating DW (using Pentaho as BI server), I made an array of scheduled database queries which creates 4 dimension tables and 1 fact table (for sales reports).
Now, there is a need for spreading out DW (for import, warehouse, logistic reports), so my question is: Do I have to make more fact tables for each department (and dimension tables for those), or there is some other structuring model?
Of course, this time it will be done using ETL tool, but need general advice.
Thanks,
Stevan
A fact table represents a process or events that you want to analyze.
If there is not a new process or event that you want to analyze, then you do not need a new fact table.
Perhaps you could give us some more details about what you are analyzing...

Resources