Snowflake schema dimension - data-modeling

This is the first time i'm working on a BI project and Pentaho products are not yet familiar to me, so i needed to know if the following models are correct and that i won't face difficulties later when creating hierarchies on the BI Server !
Thank you.
Time dimension :
Complication dimension, every complication can have sub-complications :

Not a good idea.
Your calendar dimension table should look like this:
create table calendar (
calendar_id int primary key,
name text not null unique,
date_iso date unique,
year smallint,
quarter smallint,
month smallint,
month_name text
...
);
insert into calendar values
(0, 'N/A', null, null, null, null, null),
(20130826, 'Aug 26, 2013', '2013-08-26', 2013, 3, 8, 'August');
The point of a data warehouse is to ease analysis. Making your BI analyst do three joins to get a date does not ease analysis.
calendar_id is a "smart key", i.e. not a meaningless surrogate key. Your calendar table is the only table that should use a smart key, as it greatly aids table partitioning by date. Also note the nullable fields, which allows for the "N/A" (Not Available) date. There's no year 0, so 0 is a good "N/A" value.
Basically, you should have one level of normalization: your fact tables, and your dimension tables.

Related

Elimination of cycle between relationships in the proposed ER diagram

In making an ER schema for a simple database, I have encountered the following problem:
I get a cycle in the diagram, which I don't know if it is redundant or I could eliminate it somehow.
I present the problem on a large scale:
The visit entity records visits to London by a vehicle. This entity contains information on their arrival, departure and total visit time.
The vehicle entity contains information on the vehicle's place of origin, CO2 emissions and its number plate.
The entity date contains information for each date of the day of the week to which it corresponds and the name of the holiday for any region added.
The region of the entity Date is matched to the Vehicle region. The entry_date/end_date of the entity Visit is matched to the date of the entity Date. Finally, the number plate of the entity Vehicle is matched to the number plate of the entity Visit. In this way, the cycle that I mentioned at the beginning appears.
The ER diagram is as follows:
If there are any questions about the problem that I have not explained, please do not hesitate to ask me. I welcome suggestions for improving the ER diagram, either to remove the cycle or to simply keep it as it is if you think it is correct.
My two cents -
"Date" is really not a good name for entity or table.
First, it is too general to convey what you really refer to.
Second, it is a reserved key word in most common languages. You just cause unnecessary trouble for programming.
You use "Date" to get holiday name (for particular region) and week day, right?
My suggestion is that you only need to save holidays in this table because weekday can be figured out in most common programming language.
This "Date" table is just a lookup table to help you find out holiday, you do not need to enforce relation between "Date" and Visit
I'd also suggest you add Region table to enforce consistent naming.
Here is the DB diagram, I renamed "date" to "holiday"
Here is the SQL server implementation -
create table region (
region_code varchar(100) primary key
,region_name varchar(100)
)
create table holiday (
holiday_date date not null
,region_code varchar(100) not null
,holiday_name varchar(100) not null
)
alter table holiday add primary key (holiday_date, region_code)
alter table holiday add foreign key (region_code) references region (region_code)
create table vehicle (
number_plate varchar(100) primary key
,region_code varchar(100) not null
,CO2_emission varchar(100)
)
alter table vehicle add foreign key (region_code) references region (region_code)
create table visit (
number_plate varchar(100) not null
,entry_date date not null
,end_date date
)
alter table visit add primary key (number_plate, entry_date)
alter table visit add foreign key (number_plate) references vehicle (number_plate)
The Relate relationship between Vehicle and Date entity is redundant here. You can still find the dates for each visit a vehicle completed from the existing relationships. If you convert the ER diagram to DB tables, it'll be more clear.
Why are entry_date and exit_date two attributes of the Visit entity? These are already considered by the many-to-2 relationship with the Date entity. Remove these two attributes along with number_plate from Visit. Lastly, add an unique id to the Visit entity.

Primary Key Constraint, migration of table from local db to snowflake, recommended data types for json column?

What order can I copy data into two different tables to comply with the table constraints I created locally?
I created an example from the documentation, but was hoping to get recommendations on how to optimize the data stored by selecting the right types.
I created two tables, one was the list of names and the second is a list of names with a date they did something.
create or replace table name_key (
id integer not null,
id_sub integer not null,
constraint pkey_1 primary key (id, id_sub) not enforced,
name varchar
);
create or replace table recipts (
col_a integer not null,
col_b integer not null,
constraint fkey_1 foreign key (col_a, col_b) references name_key (id, id_sub) not enforced,
recipt_date datetime,
did_stuff variant
);
Insert into name_key values (0, 0, 'Geinie'), (1, 1, 'Greg'), (2,2, 'Alex'), (3,3, 'Willow');
Insert into recipts values(0,0, Current_date()), (1,1, Current_date()), (2,2, Current_date()), (3,3, Current_date());
Select * from name_key;
Select * from recipts;
Select * from name_key
join recipts on name_key.id = recipts.col_a
where id = 0 or col_b = 2;
I read: https://docs.snowflake.net/manuals/user-guide/table-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure where it recommends to change timestamps from strings to a variant. I did not include the fourth column, I left it blank for future use. Essentially it captures data in json format, so I made it a variant. Would it be better to rethink this table structure to flatten the variant column?
Also I would like to change the key to AUTO_INCRDEMENT, is there something like this in Snowflake?
What order can I copy data into two different tables to comply with the table constraints I created locally?
You need to give more context about your constraints, but you can control the order of copy statements. For foreign keys generally you want to load the table that is referenced before the table that does the referencing.
where it recommends to change timestamps from strings to a variant.
I think you misread that documentation. It recommends extracting values from a variant column into their own separate columns (in this case a timestamp column), ESPECIALLY if those columns are dates and times, arrays, and numbers within strings.
Converting a timestamp column to a variant, is exactly what it is recommending against.
Would it be better to rethink this table structure to flatten the variant column?
It's definitely good to think carefully about, and do performance tests on, situations where you are using semi-structured data, but without more information on your specific situation and data, it's hard to say.
Also I would like to change the key to AUTO_INCRDEMENT, is there something like this in Snowflake?
Yes Snowflake has an Auto_increment feature. Although I've heard this has some issue with working with COPY INTO Statements

How to generate a single GUID for all rows in a batch insert within the query?

I am writing a quick-and-dirty application to load sales plan data into SQL Server (2008 FWIW, though I don't think the specific version matters).
The data set is the corporate sales plan: a few thousand rows of Units, Dollars and Price for each combination of customer, part number and month. This data is updated every few weeks, and it's important to track who changed it and what the changes were.
-- Metadata columns are suffixed with ' ##', to enable an automated
-- tool I wrote to handle repetitive tasks such as de-duplication of
-- records whose values didn't change in successive versions of the
-- forecast.
CREATE TABLE [SlsPlan].[PlanDetail]
(
[CustID] [char](15) NOT NULL,
[InvtID] [char](30) NOT NULL,
[FiscalYear] [int] NOT NULL,
[FiscalMonth] [int] NOT NULL,
[Version Number ##] [int] IDENTITY(1,1) NOT NULL,
[Units] [decimal](18, 6) NULL,
[Unit Price] [decimal](18, 6) NULL,
[Dollars] [decimal](18, 6) NULL,
[Batch GUID ##] [uniqueidentifier] NOT NULL,
[Record GUID ##] [uniqueidentifier] NOT NULL DEFAULT (NEWSEQUENTIALID()),
[Time Created ##] [datetime] NOT NULL,
[User ID ##] [varchar](64) NULL DEFAULT (ORIGINAL_LOGIN()),
CONSTRAINT [PlanByProduct_PK] PRIMARY KEY CLUSTERED
([CustID], [InvtID], [FiscalYear], [FiscalMonth], [Version Number ##])
)
To track changes, I'm using an IDENTITY column as part of the primary key to enable multiple version with the same primary key. To track who did the change, and also to enable backing out an entire bad update if someone does something completely stupid, I am inserting the Active Directory logon of the creator of that version of the record, a time stamp, and two GUIDs.
The "Batch GUID" column should be the same for all records in a batch; the "Record GUID" column is obviously unique to that particular record and is used for de-duplication only, not for any sort of query.
I would strongly prefer to generate the batch GUID inside a query rather than by writing a stored procedure that does the obvious:
DECLARE #BatchGUID UNIQUEIDENTIFIER = NEWID()
INSERT INTO MyTable
SELECT I.*, #BatchGUID
FROM InputTable I
I figured the easy way to do this is to construct a single-row result with the timestamp, the user ID and a call to NEWID() to create the batch GUID. Then, do a CROSS JOIN to append that single row to each of the rows being inserted. I tried doing this a couple different ways, and it appears that the query execution engine is essentially executing the GETDATE() once, because a single time stamp appears in all rows (even for a 5-million row test case). However, I get a different GUID for each row in the result set.
The below examples just focus on the query, and omit the insert logic around them.
WITH MySingleRow AS
(
Select NewID() as [Batch GUID ##],
ORIGINAL_LOGIN() as [User ID ##],
getdate() as [Time Created ##]
)
SELECT N.*, R1.*
FROM util.zzIntegers N
CROSS JOIN MySingleRow R1
WHERE N.Sequence < 10000000
In the above query, "util.zzIntegers" is just a table of integers from 0 to 10 million. The query takes about 10 seconds to run on my server with a cold cache, so if SQL Server were executing the GETDATE() function with each row of the main table, it would certainly have a different value at least in the milliseconds column, but all 10 million rows have the same timestamp. But I get a different GUID for each row. As I said before, the goal is to have the same GUID in each row.
I also decided to try a version with an explicit table value constructor in hopes that I would be able to fool the optimizer into doing the right thing. I also ran it against a real table rather than a relatively "synthetic" test like a single-column list of integers. The following produced the same result.
WITH AnotherSingleRow AS
(
SELECT SingleRow.*
FROM (
VALUES (NewID(), Original_Login(), getdate())
)
AS SingleRow(GUID, UserID, TimeStamp)
)
SELECT R1.*, S.*
FROM SalesOrderLineItems S
CROSS JOIN AnotherSingleRow R1
The SalesOrderLineItems is a table with 6 million rows and 135 columns, to make doubly sure that runtime was sufficiently long that the GETDATE() would increment if SQL Server were completely optimizing away the table value constructor and just calling the function each time the query runs.
I've been lurking here for a while, and this is my first question, so I definitely wanted to do good research and avoid criticism for just throwing a question out there. The following questions on this site deal with GUIDs but aren't directly relevant. I also spent a half hour searching Google with various combinations of phrases didn't seem to turn up anything.
Azure actually does what I want, as evidenced in the following question I
turned up in my research:
Guid.NewGuid() always return same Guid for all rows.
However, I'm not on Azure and not going to go there anytime soon.
Someone tried to do the same thing in SSIS
(How to insert the same guid in SSIS import)
but the answer to that query came back that you generate the GUID in
SSIS as a variable and insert it into each row. I could certainly do
the equivalent in a stored procedure but for the sake of elegance and
maintainability (my colleagues have less experience with SQL Server queries
than I do), I would prefer to keep the creation of the batch GUID in
a query, and to simplify any stored procedures as much as possible.
BTW, my experience level is 1-2 years with SQL Server as a data analyst/SQL developer as part of 10+ years spent writing code, but for the last 20 years I've been mostly a numbers guy rather than an IT guy. Early in my career, I worked for a pioneering database vendor as one of the developers of the query optimizer, so I have a pretty good idea what a query optimizer does, but haven't had time to really dig into how SQL Server does it. So I could be completely missing something that's obvious to others.
Thank you in advance for your help.

Storing detailed data in SQL Server

I’m designing a database in which I save votes.
I’ve created a table:
CREATE TABLE [dbo].[users_votes](
[id] [bigint] NOT NULL,
[like_votes] [int] NOT NULL DEFAULT ((0)),
[dislike_votes] [int] NOT NULL DEFAULT ((0)),
[commented_votes] [int] NOT NULL DEFAULT ((0)),
[comments_likes] [int] NOT NULL DEFAULT ((0))
The issue is that there is a requirement to also store the breakdown data by location.
So for example if user_votes table has 1,000 like_votes for a specific id, I need to know the break down by location, e.g.:
United States 340
France 155
Denmark 25
Brazil 290
Australia 190
I’m getting the data from the client as comma delimited String, for example:
(1,2,45,67,87,112,234) and the country code for location (us, au, ca, etc...).
I’ve been thinking about a few possibilities to store this data, but wanted to know which of these approached is best suited (if any).
As the number of country codes is finite, I can expand users_votes table and add columns with country codes for each criteria. E.g. like_votes_us, dislike_votes_us, comment_votes_us, comment_likes_us.
In this case I will probably use Dynamic SQL to insert/update the data.
Create new tables for each column, so for example I will have a table named like_votes, in which I will have an id, external_id which will be users_votes (table) id, country_code, and count column. So the data will be stored in users_votes and also in like_votes table. I will have a record for each combination of external_id and country code.
In this case I will need to iterate the inserted data in order to determine if this external_id combination exists (and then just increment it) or it needs to be inserted.
Which approach, if any, is the optimal way to store this data so it will be easy to insert/update and also to query?
This type of table design you have at the moment isn't a good idea, in all honesty. One big important point of building a good relational database is using Normal Form. i'm not going to explain what that is here, as there are 10's of thousands of articles on the internet explaining it, and its different iterations (from 1NF to 6NF iirc).
Anyway, you can easily do this with a few tables. I having to guess a lot of your set up here, but hopefully you'll be able to extrapolate what you need, and adjust what doesn't.
Firstly, let's start with a client table:
CREATE TABLE dbo.Client (ClientID int IDENTITY(1,1),
ClientName varchar(100), --You should really split this into Title, Forename and Surname, I'm just being "lazy" here
ClientCountryID int, --Not sure if a Client is related to a country or the vote is, i've guessed the client is.
DOB date,
EmailAddress varchar(100));
GO
So, we have a simple Client Table now. Next, we want a Country Table. This is very simple:
CREATE TABLE dbo.Country (CountryID int IDENTITY(1,1),
CountryName varchar(100),
CountryCode char(2)); --For example UK for United Kingdom, FR for France, etc
GO
You might want to store additional content there, but I don't know your set up.
Now, this is where I'm really guessing a lot. I'm assuming that your likes and dislikes, etc, are linked to something. What, I have no idea, so, I'm going to have a table called "Content", however, not knowing what these likes are against, I have no context for this table, thus it's going to be very basic:
CREATE TABLE dbo.Content (ContentID int IDENTITY(1,1),
ContentType int, --Guessing might be types, maybe videos, Comments, articles? I have no idea to be honest)
ContentParent int, --Comments are joined to a Content (just like here on SO)? I'll guess it's possible
Content nvarchar(MAX)); --because I have no idea what's going in there
--Very simple Content Type Table
CREATE TABLE dbo.ContentType (TypeID int IDENTITY(1,1),
TypeDescription varchar(100));
GO
Now, finally, we can get onto the votes that you want to store; which might look something like this:
CREATE TABLE dbo.Vote (VoteID int IDENTITY(1,1),
ClientID int,
ContentID int,
Liked bit); --1 for Liked, 0 for Disliked, NULL for N/A perhaps?
GO
Ok, now we have some tables. Now I realise I haven't given any kind of Sample data to go in here, so I'll provide a few INSERTS statements for you, so you can get the idea:
INSERT INTO dbo.Country (CountryName, CountryCode)
VALUES ('United Kingdom','GB'),
('France','FR'),
('Germany','DE');
GO
INSERT INTO dbo.Client (ClientName, ClientCountryID, DOB, EmailAddress)
VALUES ('Mr John Smith',1, '19880106','Bob#gmial.com'),
('Ms Penelope Vert',2,'19930509','PVert#mfn.com');
GO
INSERT INTO dbo.ContentType (TypeDescription)
VALUES ('Video'),('Article'),('Comment');
GO
INSERT INTO dbo.Content (ContentType, ContentParent, Content)
VALUES (2, NULL, 'This is my first article, hi everyone!'),
(3, 1, 'Nice! Good to see you''re finally posting!'),
(1, NULL, 'http://youtube.com');
GO
--And now some votes:
INSERT INTO dbo.Vote (ClientID, ContentID, Liked)
VALUES (1, 1, 1),
(2, 1, 1),
(2, 2, 1),
(2, 3, 0);
GO
Notice how I've put the votes in. I've not aggregated in the table; doing so is an awful idea. instead store each vote individually and use a query to Aggregate. You can easily do this, for example:
SELECT C.ContentID,
Cy.CountryName,
COUNT(CASE V.Liked WHEN 1 THEN 1 END) AS LikedVotes,
COUNT(CASE V.Liked WHEN 0 THEN 1 END) AS DisLikedVotes
FROM dbo.Content C
JOIN dbo.Vote V ON C.ContentID = V.ContentID
JOIN dbo.Client CV ON V.ClientID = CV.ClientID
JOIN dbo.Country Cy ON CV.ClientCountryID = Cy.CountryID
GROUP BY C.ContentID,
Cy.CountryName;
This gives you the number of Liked Votes per Content Item, and splits it into Countries as well for you. If you want to put these countries into their own columns, then I strongly suggest doing this in your presentation layer, not your SQL (as you'll have to use Dynamic SQL, and (no offence) I imagine this is beyond your skills at the moment based on your current database design choice(s)). Excel is very good at going this using Pivot tables. if you want to retain the process in SQL Server, consider using SSRS and a matrix.
If you have any questions, please do ask.
Note: I have no made any kind of foreign keys, constraints, Default values, etc here. These are a definate must for any good database design.
Clean Up script:
DROP TABLE dbo.Client;
DROP TABLE dbo.Country;
DROP TABLE dbo.Vote;
DROP TABLE dbo.Content;
DROP TABLE dbo.ContentType;
GO

Database entity model for an availability tool where there are multiple users selecting availability for multiple times across multiple weeks

I am trying to create a database entity model structure for an user availability application. I have a number of users who will give their availability up to two weeks in advance for either AM/PM or EVE (evening) of each day of the week, I am unsure of how I should represent this as database tables.
My question is: How should I create the entity model? I am struggling to see how i can create this without having to build 52 tables for each week of the year, every year. Am I missing something extremely important/simple?
Any help would be hugely appreciated
Something like this:
create table UserAvailability (
Id int Identity(1, 1) not null, -- primary key
UserId int not null, -- FK to User table
AvailabilityDay Date not null, -- Day of availability
StartTime DateTime not null,
EndTime DateTime not null
)
It would also have a unique key in UserInt + AvailabilityDay, unless you have multiple availability periods per user per day.
Cheers -
EDIT
After additional data:
create table UserAvailability (
Id int Identity(1, 1) not null, -- primary key
UserId int not null, -- FK to User table
AvailabilityDay Date not null, -- Day of availability
Morning bit not null default(0),
Afternoon bit not null default(0),
Evening bit not null default(0)
)
Note that this is actually in 2nd normal form, not third. Reporting requirements (and the rest of the DB) dictate whether it really should be in third.

Resources