I'm looking at the datekey column from the fact tables in AdventureWorksDW and they're all of type int.
Is there a reason for this and not of type date?
I understand that creating a clustered index composed of an INT would optimize query speed. But let's say I want to get data from this past week. I can subtract 6 from date 20170704 and I'll get 20170698 which is not a valid date. So I have to cast everything to date, subtract, and then cast as int.
Right now I have a foreign key constraint to make sure that something besides 'YYYYMMDD' isn't inserted. It wouldn't be necessary with a Date type. Just now, I wanted to get some data between 6/28 and 7/4. I can't just subtract six from `20170703'; I have to cast from int to date.
It seems like a lot of hassle and not many benefits.
Thanks.
Yes, you could be using a Date data type and have that as your primary key in the Fact and the dimension and you're going to save yourself a byte in the process.
And then you're going to have to deal with a sale that is recorded and we didn't know the date. What then? In a "normal" dimensional model, you define Unknown surrogate values so that people know there is data and it might be useful but it's incomplete. A common convention is to make it zero or in the negative realm. Easy to do with integers.
Dates are a little weird in that we typically use smart keys - yyyymmdd. From a debugging perspective, it's easy to quickly identify what the date is without having to look up against your dimension.
You can't make an invalid date. Soooo what then? Everyone "knows" that 1899-12-31 is the "fake" date (or whatever tickles your fancy) and that's all well and good until someone fat fingers a date and magically hit your sentinel date and now you've got valid unknowns mixed with merely bad data entry.
If you're doing date calculations against an smart key, you're doing it wrong. You need to go to your data dimension to properly resolve the value and use methods that are aware of date logic because it's ugly and nasty beyond just simple things like month lengths and leap year calculations.
Actually that fact table has a relationship to a table DimDate, and if you join that table you would get many more options for point in time search, then if you would`ve got by adding and removing days/months.
Say you need list of all orders on second Saturday of May? Or all orders on last week of december?
Also some business regulate their fiscal year different. Some start in June, some start in January..
In summary, DimDate is there to provide you with flexibility when you need to do complicated date searches without doing any calculations, and using a simple index seek on DimDate
It's a good question, but the answer depends on what kind of datawarehouse you're aiming for. SSAS, for instance, covers tabular and multi-dimensional.
In multi-dimensional, you would never be querying the fact table itself through SQL, so the problem you note with e.g. subtracting 6 days from 20170704 would actually never arise. Because in MD SSAS you'd use MDX on the dimension itself to implement date logic (as suggested in #S4V1N's answer above). Calendar.Date.PrevMember(6). And for more complicated stuff, you can build all kinds of date hierarchies and get into MDX ParallelPeriod and FirstChild and that kind of thing.
For a datawarehouse that you're intending to use with SQL, your question has more urgency. I think that in that case #S4V1N's answer still applies: restrict your date logic to the dimension side
because that's where it's already implemented (possibly with pre-built calendar and fiscal hierarchies).
Because your logic will operate on an order of magnitude less rows.
I'm perfectly happy to have fact tables keyed on an INT-style date: but that's because I use MD SSAS. It could be that AdventureWorksDW was originally built with MD SSAS in mind (where whether the key used in fact tables is amenable to SQL is irrelevant), even though MS's emphasis seems to have switched to Tabular SSAS recently. Or the use of INTs for date keys could have been a "developer-nudging" design decision, meant to discourage date operations on the fact tables themselves, as opposed to on the Date dimension.
The thread is pretty old, but my two cents.
At one of the clients I worked at, the design chosen was an int column. The reason given (by someone before I joined) was that there were imports from different sources - some that included time information and some that only provided the date information (both strings, to begin with).
By having an int key, we could then retain the date/datetime information in a datetime column in the Fact table, while at the same time, have a second column with just the date portion (Data type: date/datetime) and use this to join to Dim table. This way the (a) aggregations/measures would be less involved (b) we wouldn't prematurely discard time information, which may be of value at some point and (c) at that point, if required the Date dimension could be refactored to include time OR a new DateTime dimension could be created.
That said, this was the accepted trade-off there, but might not be a universal recommendation.
Now a very old thread,
For non-date columns a sequential integer key is considered best practice, because it is fast, and reasonably small. A natural key which encapsulates business logic could change overtime and also may need some method of identifying which version of that dimension it is for a slowly changing dimension.
[https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/dimension-surrogate-key/][1]
Ideally for consistency a date dimension should also have a sequential integer key, so why is it different? After all the theory of debugging could be also applied to other (non-date) dimensions. From The Data Warehouse Toolkit, 3rd Edition, Kimball & Ross, page 49 (Calendar Date Dimension) is this comment
To facilitate partitioning, the primary key of a date dimension can be
more meaningful, such as an integer representing YYYYMMDD, instead of
a sequentially-assigned surrogate key.
Although I think this means partitioning of a fact table. I argue that the datekey is an integer to allow for consistency with other dimensions but not a sequential key to allow for easier table partitioning.
Related
I'm trying to create a schema that will allow me to define times, when a supplier website is non operational (planned not unplanned).
I've gone for non-operational as opposed to operational because many suppliers work 24/7, so non-operting times represent the least number of rows.
For example, a supplier might not work:
On a Sunday
On a recognised holiday date - '1/1/2015'
On a Saturday after 5pm
I'm not overly confident with SQL Server, but have come up with a schema that 'does the job'. However, as we all know, there are good ways, not so good ways, and bad ways, that all work in a fashion, so would appreciate comments and advice on what I have to date.
One of the key features is to use data from WorkingDays and Holidays together to represent a WorkingPeriod entity.
I would appreciate coments no matter how small.
Holiday
Contains all recognised holidays - Easter Monday, Good Friday etc.
HolidayDate
Contains dates of holidays. For instance, this year Easter Monday is 6th Apr 2015.
WorkingDay
Sunday through to Monday, mapped to Asp.Net day of week enums.
WorkingPeriodType
A lookup table containing 2 rows - Holiday, or Day of Week
WorkingPeriod
Merges the Holiday table and the WorkingDay table to represent a single WorkingPeriod entity that can be used in the SupplierNonWorkingTimes table.
SupplierNonWorkingTimes
Contains the ID representing the WorkingDay/Holiday and the times of non- operation.
This is a very subjective question, as you've already observed there's no right and wrong, just different ways. I'm a database guy but I don't know your specific circumstances, so this is just some observations - you'll have to judge for yourself whether any of them are appropriate to you.
I like my naming to be crystal clear, it saves all the
misunderstanding by other people later on. If [WorkingDay] holds the
7 days of the week I would call it [WeekDay]. If you intend
[Holiday] to hold whole-day holidays I would call it [HolidayDay].
The main table [SupplierNonWorkingTime] is about 'non-working' so I
would call the [WorkingPeriod] table [NonWorkingPeriod]. The term
'period' always refers to a whole day, so I would replace 'period'
with 'day' (let's ignore start/stop time for now).
My first impression was that your design is over-normalised. The
[WorkingPeriodType] table has 2 rows that will never change,
[WorkingDay] has 7. For these very low numbers I sometimes prefer a
char(1) with a check constraint. Normalisation is generally good,
but lots of JOINs for trivial queries is not so good. You could
eliminate [WorkingPeriodType] and [WorkingDay] but you've mentioned
.Net enums in your question so if you've got some sort of ORM in
your .Net code this level of normalisation might be right for you.
I'd add a Year field to the [HolidayDate] table, then the PK
becomes a better HolidayID+Year - unless you know somewhere that has
lots of Christmas' :)
I'd add an IsAllDay field to the [SupplierNonWorkingTime] table,
otherwise you have to use 'magic values' to represent 'all day' and
magic values are bad. There should be a check constraint to enforce
start/stop times can only be entered if IsAllDay = false.
Like I said, just my thoughts, hope it's helpful.
I have a SQL Server 2008 database with a snowflake-style schema, so lots of different lookup tables, like Language, Countries, States, Status, etc. All these lookup table have almost identical structures: Two columns, Code and Decode. My project manager would like all of these different tables to be one BIG table, so I would need another column, say CodeCategory, and my primary key columns for this big table would be CodeCategory and Code. The problem is that for any of the tables that have the actual code (say Language Code), I cannot establish a foreign key relationship into this big decode table, as the CodeCategory would not be in the fact table, just the code. And codes by themselves will not be unique (they will be within a CodeCategory), so I cannot make an FK from just the fact table code field into the Big lookup table Code field.
So am I missing something, or is this impossible to do and still be able to do FKs in the related tables? I wish I could do this: have a FK where one of the columns I was matching to in the lookup table would match to a string constant. Like this (I know this is impossible but it gives you an idea what I want to do):
ALTER TABLE [dbo].[Users] WITH CHECK ADD CONSTRAINT [FK_User_AppCodes]
FOREIGN KEY('Language', [LanguageCode])
REFERENCES [dbo].[AppCodes] ([AppCodeCategory], [AppCode])
The above does not work, but if it did I would have the FK I need. Where I have the string 'Language', is there any way in T-SQL to substitute the table name from code instead?
I absolutely need the FKs so, if nothing like this is possible, then I will have to stick with my may little lookup tables. any assistance would be appreciated.
Brian
It is not impossible to accomplish this, but it is impossible to accomplish this and not hurt the system on several levels.
While a single lookup table (as has been pointed out already) is a truly horrible idea, I will say that this pattern does not require a single field PK or that it be auto-generated. It requires a composite PK comprised of ([AppCodeCategory], [AppCode]) and then BOTH fields need to be present in the fact table that would have a composite FK of both fields back to the PK. Again, this is not an endorsement of this particular end-goal, just a technical note that it is possible to have composite PKs and FKs in other, more appropriate scenarios.
The main problem with this type of approach to constants is that each constant is truly its own thing: Languages, Countries, States, Statii, etc are all completely separate entities. While the structure of them in the database is the same (as of today), the data within that structure does not represent the same things. You would be locked into a model that either disallows from adding additional lookup fields later (such as ISO codes for Language and Country but not the others, or something related to States that is not applicable to the others), or would require adding NULLable fields with no way to know which Category/ies they applied to (have fun debugging issues related to that and/or explaining to the new person -- who has been there for 2 days and is tasked with writing a new report -- that the 3 digit ISO Country Code does not apply to the "Deleted" status).
This approach also requires that you maintain an arbitrary "Category" field in all related tables. And that is per lookup. So if you have CountryCode, LanguageCode, and StateCode in the fact table, each of those FKs gets a matching CategoryID field, so now that is 6 fields instead of 3. Even if you were able to use TINYINT for CategoryID, if your fact table has even 200 million rows, then those three extra 1 byte fields now take up 600 MB, which adversely affects performance. And let's not forget that backups will take longer and take up more space, but disk is cheap, right? Oh, and if backups take longer, then restores also take longer, right? Oh, but the table has closer to 1 billion rows? Even better ;-).
While this approach looks maybe "cleaner" or "easier" now, it is actually more costly in the long run, especially in terms of wasted developer time, as you (and/or others) in the future try to work around issues related to this poor design choice.
Has anyone even asked your project manager what the intended benefit of this is? It is a reasonable question if you are going to spend some amount of hours making changes to the system that there be a stated benefit for that time spent. It certainly does not make interacting with the data any easier, and in fact will make it harder, especially if you choose a string for the "Category" instead of a TINYINT or maybe SMALLINT.
If your PM still presses for this change, then it should be required, as part of that project, to also change any enums in the app code accordingly so that they match what is in the database. Since the database is having its values munged together, you can accomplish that in C# (assuming your app code is in C#, if not then translate to whatever is appropriate) by setting the enum values explicitly with a pattern of the first X digits are the "category" and the remaining Y digits are the "value". For example:
Assume the "Country" category == 1 and the "Language" catagory == 2, you could do:
enum AppCodes
{
// Countries
United States = 1000001,
Canada = 1000002,
Somewhere Else = 1000003,
// Languages
EnglishUS = 2000001,
EnglishUK = 2000002,
French = 2000003
};
Absurd? Completely. But also analogous to the request of merging all lookup tables into a single table. What's good for the goose is good for the gander, right?
Is this being suggested so you can minimise the number of admin screens you need for CRUD operations on your standing data? I've been here before and decided it was better/safer/easier to build a generic screen which used metadata to decide what table to extract from/write to. It was a bit more work to build but kept the database schema 'correct'.
All the standing data tables had the same basic structure, they were mainly for dropdown population with occasional additional fields for business rule purposes.
I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
rates
start_date DATE
end_date DATE
price DOUBLE
availability
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
Thank you for your comments.
This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.
I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.
Sorry for the long winded title, but the requirement/problem is rather specific.
With reference to the following sample (but very simplified) structure (in psuedo SQL), I hope to explain it a bit better.
TABLE StructureName {
Id GUID PK,
Name varchar(50) NOT NULL
}
TABLE Structure {
Id GUID PK,
ParentId GUID, -- FK to Structure
NameId GUID NOT NULL -- FK to StructureName
}
TABLE Something {
Id GUID PK,
RootStructureId GUID NOT NULL -- FK to Structure
}
As one can see, Structure is a simple tree structure (not worried about ordering of children for the problem). StructureName is a simplification of a translation system. Finally 'Something' is simply something referencing the tree's root structure.
This is just one of many tables that need to be versioned, but this one serves as a good example for most cases.
There is a requirement to version to any changes to the name and/or the tree 'layout' of the Structure table. Previous versions should always be available.
There seems to be a few possibilities to tackle this issue, like copying the entire structure, but most approaches causes one to 'loose' referential integrity. Example if one followed this approach, one would have to make a duplicate of the 'Something' record, given that the root structure will be a new record, and have a new ID.
Other avenues of possible solutions are looking into how Wiki's handle this or go a lot further and look how proper version control systems work.
Currently, I feel a bit clueless how to proceed on this in a generic way.
Any ideas will be greatly appreciated.
Thanks
leppie
Some quick ideas:
Full copy: Create a copy of the structure, but for every table add a version_id column to the PK and all FKs; thus you can create copies of the life data with complete referential integrity.
pro: easy to query the history
con: large amount of (redundant data copied)
Change copy: Only copy the stuff that actually changes, along with valid_from / valid_to data.
pro: low data volum copied
con: hard to query, because one has to join on intervals
Variation: This applies to both schemes. Instead of creating a copy of the structure, you might keept the current record in the same table as the old versions, but tag it as current.
pro: smaller number of tables, easier mixing of history and current information
con: normal operation operates on much bigger tables, which will cause a performance impact
Auditing log: Depending on your actual requirements it be sufficient to just create an audit trail like this:
id, timestamp, changed_table, changed_column, old_value, new_value, changed_by
You might extend that to a full table structure:
transaction, table_change, changed_column
pro: generic, hence easy to implement for a large number of tables
con: if you need to reconstruct the state of a set of records at a given time, querying will become a nightmare
I wrote a blog about various approaches to versioning, but be warned: it's in German.
The data warehousing folks have several algorithms for "slowly-changing dimensions".
The more sophisticated algorithms provide data ranges around a dimension value to indicate when it's valid.
Depending on your versioning requirements you could do one of these things, cribbed from Kimball's The Data Warehousing Toolkit.
Assign a version number to rows of the structure table. This means you have to do some reasoning to collect a a complete structure. It includes the selected version number unioned with rows that are unchanged in an earlier version.
Assign a date range or version range to rows of the structure table. This means that some rows have start dates and end dates; some rows will have end dates at some epoch in the impossible future. Or, if you use version numbers, you'll have a start-end pair or a start-infinity pair that indicates this row is still current. You can then trivially query the rows that are valid "today" or apply to the requested version.
Clone the structure for each version. This unpleasant because the clone operation is costly. The queries however, are trivial because the entire structure is available with a single, consistent version number.
We're doing a complex bit of data accumulation. Our customer sends us some stuff that includes two dimensions (time and a business unit). Time is mostly year-month. The business unit dimension has just a few attributes: a name, and a few categories to which BU's can belong for reporting and analysis purposes.
The stuff they send us includes some current state information (dates and codes). These seem fact-like. They also send some information that characterizes the relationship with the business unit (mostly additional codes). Again, these are unique to the business unit and time period.
Finally, they send us stuff that is clearly additive facts. It includes currency and counts that have proper units.
Should I commingle this qualitative information in a single fact table with the additive facts? Or should I separate the qualitative stuff (which can only be used with counts) from the quantitative stuff (which can be used with sum)?
Only put things in the fact table if they are degenerate (causing a high-cardinality/uniqueness problems in your dimension where it takes the dimension to a 1-1 relationship to the fact table). Kimball recommends avoiding the temptation to put anything but degenerate dimensions in with the facts (unique order number, for instance).
You can always put these in what Kimball calls a "junk" dimension. All those codes can simply be lumped into a junk dimension. Most dates would go in the fact table as keys into your date dimension in a particular role (usually with a natural int key of the form YYYYMMDD - one of the only times we don't use a non-identity meaningless surrogate key)
I like to naively view the star as all the facts and then which columns go into which dimensions is simply determined by convenience. One should not necessarily view them as corresponding to a particular business entity - remember, the star is not an ERD-style normalized OLTP database.
If the data is both directly related to the additive fact and is not something you want to be grouping/sorting/search on, then putting it in the fact table is okay.
Be aware, though, that non-additive data in the fact table will either prevent roll-ups or will become a lossy operation.
Brad Wilson accurately describes the risk of adding them to your fact table. In the past, I've added junk attributes to my fact table only to require refactoring later.
The stuff they send us includes some
current state information (dates and
codes). These seem fact-like. They
also send some information that
characterizes the relationship with
the business unit (mostly additional
codes). Again, these are unique to the
business unit and time period.
What business purpose do the dates serve? Offhand, I'd recommend making these their own dimensions and describe them accurately.
How volatile are the extra codes that come in? If the grain of your fact table is date and BU, why can't they be included in the BU dimension and treated as slowly changing attributes?
Without more details I can't make a firm recommendation but these would be the first questions I'd ask myself.