The transactional fact table of one ofthe star schemas need to anser questions like Is the first application is final application.This is associated with one of the business process.
Is it a good idea to keep this as a part of the fact table with a column name,
IsFirstAppLastFlag.
There are not much flags to create a seperate dimension.Also this flag(calculated flag) is essential in the report writing.In this context do we need to keep it in Dimension or in Fact!
I assume the creation of junk dimension is for those flags /low cardinality columns which are not so useful can kept it inside a dimension?!
This will depend on your own needs but if you like the purest view of the fact table then the answer is no, these fields should not be included in your fact table.
The fact table should include dimension keys, degenerate dimension keys, and facts.
IsStatusOne, IsStatusTwo, etc are attributes and as you rightly suggest would be well suited to a junk dimension in the absence of them belonging to a more suitable dimension, e.g., IsWeekDay would be suited to dimension "Date" table.
You may start off with only a few "Is" attributes in your fact table but over time you may need more and more of these attributes, you will look back and possibly wish you created a junk dimension.
Performance:
Interestingly if you are using bit columns for your flags then then there is little storage difference in using 8 bit flags in your fact table then having one tinyint dimension key, however when your flags are more verbose or have multiple status values then you should use the junk dimension to improve performance on the fact table, less storage, memory, more rows in a page, etc..
Personally, I would junk them
That seems fine, as long as it it an attribute of the fact, not of one of the dimensions. In some cases I think you might have a slowly changing dimension in which it would be more appropriately placed.
I would be concerned that this plan might require updates on the fact table, for example if you were intending to flag that a particular fact was the most recent for a customer. If that was the case it might be better to keep a transaction number in the fact table, and a "most recent transaction number" in the dimension table, and provide an indexing method to effectively retrieve the most recent per-customer.
You can use Junk Dimension.
Instead of creating several dimension with few rows you can create on dimnsion with all possible combination of value then you add just one foregion key in your fact table.
you can populate your junk dimension with a query like below.
WITH cteFlags AS
(
SELECT 'N' AS Value
UNION ALL
SELECT 'Y'
)
SELECT
Flag1.Value,
Flag2.Value,
Flag3.Value
FROM
cteFlags Flag1
CROSS JOIN cteFlags Flag2
CROSS JOIN cteFlags Flag3
Related
In our database design we have a couple of tables that describe different objects but which are of the same basic type. As describing the actual tables and what each column is doing would take a long time I'm going to try to simplify it by using a similar structured example based on a job database.
So say we have following tables:
These tables have no connections between each other but share identical columns. So the first step was to unify the identical columns and introduce a unique personId:
Now we have the "header" columns in person that are then linked to the more specific job tables using a 1 to 1 relation using the personId PK as the FK. In our use case a person can only ever have one job so the personId is also unique across the Taxi driver, Programmer and Construction worker tables.
While this structure works we now have the use case where in our application we get the personId and want to get the data of the respective job table. This gets us to the problem that we can't immediately know what kind of job the person with this personId is doing.
A few options we came up with to solve this issue:
Deal with it in the backend
This means just leaving the architecture as it is and look for the right table in the backend code. This could mean looking through every table present and/or construct a semi-complicated join select in which we have to sift through all columns to find the ones which are filled.
All in all: Possible but means a lot of unecessary selects. We also would like to keep such database oriented logic in the actual database.
Using a Type Field
This means adding a field column in the Person table filled for example with numbers to determine the correct child table like:
So you could add a 0 in Type if it's a taxi driver, a 1 if it's a programmer and so on...
While this greatly reduced the amount of backend logic we then have to make sure that the numbers we use in the Type field are known in the backend and don't ever change.
Use separate IDs for each table
That means every job gets its own ID (has to be nullable) in Person like:
Now it's easy to find out which job each person has due to the others having an empty ID.
So my question is: Which one of these designs is the best practice? Am i missing an obvious solution here?
Bill Karwin made a good explanation on a problem similar to this one. https://stackoverflow.com/a/695860/7451039
We've now decided to go with the second option because it seem to come with the least drawbacks as described by the other commenters and posters. As there was no actual answer portraying the second option as a solution i will try to summarize our reasoning:
Against Option 1:
There is no way to distinguish the type from looking at the parent table. As a result the backend would have to include all logic which includes scanning all tables for the that contains the id. While you can compress most of the logic into a single big Join select it would still be a lot more logic as opposed to the other options.
Against Option 3:
As #yuri-g said this one is technically not possible as the separate IDs could not setup as primary keys. They would have to be nullable and as a result can't be indexed, essentially rendering the parent table useless as one of the reasons for it was to have a unique personID across the tables.
Against a single table containing all columns:
For smaller use cases as the one i described in the question this might me viable but we are talking about a bunch of tables with each having roughly 2-6 columns. This would make this option turn into a column-mess really quickly.
Against a flat design with a key-value table:
Our properties have completly different data types, different constraints and foreign key relations. All of this would not be possible/difficult in this design.
Against custom database objects containt the child specific properties:
While this option that #Matthew McPeak suggested might be a viable option for a lot of people our database design never really used objects so introducing them to the mix would likely cause confusion more than it would help us.
In favor of the second option:
This option is easy to use in our table oriented database structure, makes it easy to distinguish the proper child table and does not need a lot of reworking to introduce. Especially since we already have something similar to a Type table that we can easily use for this purpose.
Third option, as you describe it, is impossible: no RDBMS (at least, of I personally know about) would allow you to use NULLs in PK (even composite).
Second is realistic.
And yes, first would take up to N queries to poll relatives in order to determine the actual type (where N is the number of types).
Although you won't escape with one query in second case either: there would always be two of them, because you cant JOIN unless you know what exactly you should be joining.
So basically there are flaws in your design, and you should consider other options there.
Like, denormalization: line non-shared attributes into the parent table anyway, then fields become nulls for non-correpondent types.
Or flexible, flat list of attribute-value pairs related through primary key (yes, schema enforcement is a trade-off).
Or switch to column-oriented DB: that's a case for it.
I have a question regarding data modelling. Suppose I have following tables 3 student tables. Source_table1 contains A_ID as primary key and Name as an attribute. Source_table2 has B_ID as Primary key and Name & Address as other attributes.Source_table3 has C_ID as Primary key and Name, Address and Age as attributes. If we want to create a new table as Student Master with all the records in that table, how can we do that? If we are creating a cross reference table then how should we approach that problem?
Integrating data from different sources is complicated. In the end, you want to end up with something like:
student (student_id PK, name, address, source1_id, source2_id, source3_id)
However, there are some issues to resolve to get there.
Identity
How will you identify matching records in the different sources? It looks like your sources use surrogate identifiers, but those have no meaning outside the context of the source databases. What you're looking for is a suitable natural key. The only common denominator among the sources is a student's name, but names are notoriously poor identifiers.
It can be useful to actually test the data rather than assume it will or won't work. For example, a query such as:
SELECT s1.name, COUNT(*) AS amount
FROM student_source_1 s1
INNER JOIN student_source_2 s2 ON s1.name = s2.name
GROUP BY s1.name
HAVING COUNT(*) > 1
repeated for (student_source_2, student_source_3) and (student_source_1, student_source_3) should give you some insight into the size of the problem.
You could match student_source_2 and student_source_3 based on both name and address. That might give better results, or worse if the two sources have different addresses (or spellings thereof) for the same student. That brings us to our second concern:
Inconsistency
Assuming you can resolve the identity problem, you may need to deal with inconsistent data. What if sources 2 and 3 have different addresses for the same student? How do you determine the correct address?
In some cases, it could be sufficient to just map the sources without resolving inconsistencies.
Winging it in the real world
One technique I use on harder cases is to build a mapping table by hand, e.g.
student_map (student_id PK, source1_id, source2_id, source3_id)
Each of the source_id columns should have a unique constraint, and usually all 3 will be nullable. This is a first step toward the student table above.
I would start by inserting all the perfect 1-to-1 matches, then left join each of the sources with the mapping table to get the unmatched records. Having the unmatched source records side-by-side and sorted makes it easy to visually spot likely matches. It's tedious and error-prone work, but sometimes it must be done regardless. For inconsistencies I might choose the most complete/best looking source as base, and fill in the gaps from the other sources. If you can involve teachers or people who are familiar with the actual students, or present them with alternatives to choose from, by all means do so.
More data can be extremely useful. If the sources have social security numbers, family information, etc, these can be used to match students. I would use any number of queries to find perfect matches among various pieces of information, and insert those into the mapping table, before doing the side-by-side matching.
You may well find that a source has internal consistency problems due to poor design - e.g. multiple records for the same student. This may require fixing the source data before continuing.
A good understanding of the relational model of data is invaluable for this kind of work, since you'll be identifying candidate keys, following dependencies and encountering anomalies.
I have a SQL Server 2008 database with a snowflake-style schema, so lots of different lookup tables, like Language, Countries, States, Status, etc. All these lookup table have almost identical structures: Two columns, Code and Decode. My project manager would like all of these different tables to be one BIG table, so I would need another column, say CodeCategory, and my primary key columns for this big table would be CodeCategory and Code. The problem is that for any of the tables that have the actual code (say Language Code), I cannot establish a foreign key relationship into this big decode table, as the CodeCategory would not be in the fact table, just the code. And codes by themselves will not be unique (they will be within a CodeCategory), so I cannot make an FK from just the fact table code field into the Big lookup table Code field.
So am I missing something, or is this impossible to do and still be able to do FKs in the related tables? I wish I could do this: have a FK where one of the columns I was matching to in the lookup table would match to a string constant. Like this (I know this is impossible but it gives you an idea what I want to do):
ALTER TABLE [dbo].[Users] WITH CHECK ADD CONSTRAINT [FK_User_AppCodes]
FOREIGN KEY('Language', [LanguageCode])
REFERENCES [dbo].[AppCodes] ([AppCodeCategory], [AppCode])
The above does not work, but if it did I would have the FK I need. Where I have the string 'Language', is there any way in T-SQL to substitute the table name from code instead?
I absolutely need the FKs so, if nothing like this is possible, then I will have to stick with my may little lookup tables. any assistance would be appreciated.
Brian
It is not impossible to accomplish this, but it is impossible to accomplish this and not hurt the system on several levels.
While a single lookup table (as has been pointed out already) is a truly horrible idea, I will say that this pattern does not require a single field PK or that it be auto-generated. It requires a composite PK comprised of ([AppCodeCategory], [AppCode]) and then BOTH fields need to be present in the fact table that would have a composite FK of both fields back to the PK. Again, this is not an endorsement of this particular end-goal, just a technical note that it is possible to have composite PKs and FKs in other, more appropriate scenarios.
The main problem with this type of approach to constants is that each constant is truly its own thing: Languages, Countries, States, Statii, etc are all completely separate entities. While the structure of them in the database is the same (as of today), the data within that structure does not represent the same things. You would be locked into a model that either disallows from adding additional lookup fields later (such as ISO codes for Language and Country but not the others, or something related to States that is not applicable to the others), or would require adding NULLable fields with no way to know which Category/ies they applied to (have fun debugging issues related to that and/or explaining to the new person -- who has been there for 2 days and is tasked with writing a new report -- that the 3 digit ISO Country Code does not apply to the "Deleted" status).
This approach also requires that you maintain an arbitrary "Category" field in all related tables. And that is per lookup. So if you have CountryCode, LanguageCode, and StateCode in the fact table, each of those FKs gets a matching CategoryID field, so now that is 6 fields instead of 3. Even if you were able to use TINYINT for CategoryID, if your fact table has even 200 million rows, then those three extra 1 byte fields now take up 600 MB, which adversely affects performance. And let's not forget that backups will take longer and take up more space, but disk is cheap, right? Oh, and if backups take longer, then restores also take longer, right? Oh, but the table has closer to 1 billion rows? Even better ;-).
While this approach looks maybe "cleaner" or "easier" now, it is actually more costly in the long run, especially in terms of wasted developer time, as you (and/or others) in the future try to work around issues related to this poor design choice.
Has anyone even asked your project manager what the intended benefit of this is? It is a reasonable question if you are going to spend some amount of hours making changes to the system that there be a stated benefit for that time spent. It certainly does not make interacting with the data any easier, and in fact will make it harder, especially if you choose a string for the "Category" instead of a TINYINT or maybe SMALLINT.
If your PM still presses for this change, then it should be required, as part of that project, to also change any enums in the app code accordingly so that they match what is in the database. Since the database is having its values munged together, you can accomplish that in C# (assuming your app code is in C#, if not then translate to whatever is appropriate) by setting the enum values explicitly with a pattern of the first X digits are the "category" and the remaining Y digits are the "value". For example:
Assume the "Country" category == 1 and the "Language" catagory == 2, you could do:
enum AppCodes
{
// Countries
United States = 1000001,
Canada = 1000002,
Somewhere Else = 1000003,
// Languages
EnglishUS = 2000001,
EnglishUK = 2000002,
French = 2000003
};
Absurd? Completely. But also analogous to the request of merging all lookup tables into a single table. What's good for the goose is good for the gander, right?
Is this being suggested so you can minimise the number of admin screens you need for CRUD operations on your standing data? I've been here before and decided it was better/safer/easier to build a generic screen which used metadata to decide what table to extract from/write to. It was a bit more work to build but kept the database schema 'correct'.
All the standing data tables had the same basic structure, they were mainly for dropdown population with occasional additional fields for business rule purposes.
I have the following tables:
Post
Id int
User
Id int
Then I have the table
Favorite
PostId int
UserId int
and the table
Vote
PostId int
UserId int
IsUpVote bit
IsDownVote bit
LastActivity datetime2
the problem is that if I merged both Favorite and Vote into a single table, then I'd have something like
UserPost
PostId int
UserId int
IsFavorited bit
IsUpVoted bit
IsDownVoted bit
LastActivity datetime2
IsDownVote couldn't be computed anymore (since now, I can't use a "doesn't exist: didn't vote; didn't vote up: voted down" pattern anymore) and LastActivity will only reflect the last time the vote has changed (either up, down, or removed). So I'd maybe have to change that field's name or it's functionality. or even both..
So the question is basically, how wrong is having two tables relating Tables A and B (Post,User) in this case, which are indexed by the same primary key (PostId,UserId) in this case, but which are intended for different uses?
Favourites and Votes seem to be two different things, so IMHO you will be better off keeping them as separate tables. As you mentioned, you would lose functionality if you merged them, and I don't see any clear benefit to merge them. Stick with what you've got unless you can provide an awesome justification for the merge.
Nothing wrong at all.
I am not saying that the DDL provided shows correctly Normalised tables, but they are somewhat Normalised. As you have identified yourself, the two tables have different purposes, they have different meaning, so technically (theoretically, academically, and in practice [code] ), they are correct.
"related to the same parents" is not a criterion (there are many instances where there are many tables related to the same parents, and which are correct)
therefore such tables will "have the same PKs and FKs", so that is not a criterion either.
Only someone with no real concept of Normalisation, and no concept of the causes of negative performance, will suggest that "just because they have the same parents (and therefore the same pair of keys/indices)", they should be merged.
Vote and Favourite are two different Things, Entities, records of Action taken. Two tables is correct.
Distinction: The real reason IsDownVoted cannot be compared anymore is that it does not apply to Favourite. You have used an Indicator (bit) to identify that (although badly named); which is really a substitute for a Null column. Nulls are not good for performance, and it is a Good Thing that you have Indicators to identify the absence of data, and therefore avoided Nulls, but that is separate to breaking a Normalised design by mereging them.
The merged table will perform slower on all accesses. When you SELECT Votes from it, you have to exclude Favourites, and vice versa, but it will be doing I/O for both, because they are located together (PostId, UserId). SO the server is forever reading twice as many rows, using twice as much cache; etc. Then you will "add speed" by adding an index for (PostId, UserId, IsFavourited), making it even slower for Inserts and Deletes (while "speeding up" Selects). Messes get compounded, guaranteed; best to not have any mess in the first place.
When the database grows, you can independently add columns to either one of Vote and Favourite, without affecting the other. In a merged table, it will introduce complications.
You accept Answers too quickly.
While I won't say what you should do table wise if you use int instead of bit and use values like 0 1 and -1 to do calculations / comparisons, this way you could compute the values you want in a relatively simple way.
Talking relational databases you should almost always aim for 3'rd normal form regarding your tables - Try looking at http://en.wikipedia.org/wiki/Database_normalization
Cheers!
I have a dimension (SiteItem) has two important facts:
perUserClicks
perBrowserClicks
however, within this dimension, I have groups of values based on an attribute column (let's call the groups AboveFoldItems, LeftNavItems, OnTheFlyItems, etc.) each have more facts that are specific to that group:
AboveFoldItems: eyeTime, loadTime
LeftNavItems: mouseOverTime
OnTheFlyItems: doesn't have any extra, but may in the future
Is the following fact table schema ok?
DateKey
SessionKey
SiteItemKey
perUserClicks
perBrowserClicks
eyeTime
loadTime
mouseOverTime
It seems a little wasteful since only some columns pertain to some dimension keys (the irrelevant facts are left NULL). But... this seems like it would be a common problem, so there should be a common solution for this, right?
I'm generally in agreement with Damir's answer on this, but because the fact table is very narrow in your particular case, there is still merit to Aaron's advocation for keeping the NULLs.
We have several star schemas in particular subject areas with multiple fact tables that share most (if not all) of the dimensions (conformed and internal). The limited-scope dimensions are not considered "conformed" across the enterprise, but they are what we would call "shared internal" dimensions.
Now typically, if the data is loaded contemporaneously so that the dimension hasn't changed, you can join both fact tables on the keys, but in general, of course, you cannot join two different star schemas on the dimension keys if they are surrogates in traditional slowly changing dimensions. In general, you have to join separate stars on the natural keys or "business keys" within the dimension and not on surrogates (except usually in the special case of the date dimension where it is unchanging and only has a natural key).
Note that when you do join the two stars, you have to use a LEFT JOIN, in which case you WILL produce NULLs which you will still probably have to take account of - so you're actually getting back to the original model you had with NULLs! ;-)
The benefit of the extra fact table is more obvious when your tables are wide with a smaller set of keys and the vertical partitioning of the data produces space savings as well as a cleaner logical model - this is especially true when the keys are only really shared up to a point - having one dummy key or NULL key is definitely not a good idea - this usually points to a dimensional modeling problem.
However, as Aaron says, if you push it to extremes, you can have a single fact column in each fact table with shared keys, which means the key overhead dwarfs the fact cost and you really do end up in a disguised EAV model.
I would also look to see if you are in Kimball's situation of "too few dimensions". Seems like you must have good dimensional attributes lumped into the SessionKey and SiteItemKey - but without seeing your entire model and requirements, it's hard to say, but I would think you would have some user demographics in a low-cardinality or even snowflake dimension without the full Session or Site dimension.
There isn't an elegant solution really, you either have nullable columns or you use an EAV solution. I posted about EAV before (and generated a lot of comments that might be worthwhile reading):
What is so bad about EAV, anyway?
I am a fan of that model in some scenarios, but if your dimensions/attributes do not change frequently, it can be a lot of extra work for nothing. NULL values in a column do not really make waste as long as the surrounding code can deal with them appropriately.
You could have more than one fact table: factperUserClicks, factperBroWserClicks, factEyeTime, etc...
Each of these would have DateKey, SessionKey, SiteItemKey. This way only dimension keys that "make sense" appear with each fact.
Ideally, there should be no NULLS in the DW -- if you keep them in the same fact table, using zeros may be more appropriate.
As far as saving disk space, I do not see an ideal solution -- but, in a DW one is supposed to trade space for speed and (query) simplicity anyway.