Dimension model (recursive / hierarchical) for Data Warehouse - data-modeling

I'm having difficulty connecting a dimension table (recursive/hierarchical) to a fact table as there are concerns/issues to deal with:
The dimension table belongs to a parent-child relationship structure
From the original table, it keeps growing
id
item_name
parent_id
1
classification
null
2
category
null
3
group
null
4
modern
1
5
modified
1
6
tools
2
7
meters
2
8
metal
3
9
plastic
3
10
lead
8
11
alloy
8
Denormalizing this kind of table is not suitable as a new entity type comes in, it would affect the dimension structure.
What is the best approach to this type?
Kindly provide an example and what would be the query statement after connecting the fact and dimension.

Related

Simple database design - some columns have multiple values

Caveat: very new to database design/modeling, so bear with me :)
I'm trying to design a simple database that stores information about images in an archive. Along with file_name (which is one distinct string), I have fields like genre and starring where each field might contains multiple strings (if an image is associated with multiple genres, and/or if an image has multiple actors in it).
Right now the database is just a single table keyed on file_name, and the fields like starring and genre just have multiple comma-separated values stored. I can query it fine by using wildcards and like and in operators, but I'm wondering if there's a more elegant way to break out the data such that it is easier to use/query. For instance, I'd like to be able to find how many unique actors are represented in the archive, but I don't think that's possible with the current model.
I realize this is a pretty elementary question about data modeling, but any guidance anyone can provide or reading you can direct me to would be greatly appreciated!
Thanks!
You need to create extra tables in order to stick with the normalization. In your situation you need 4 extra tables to represent these n->m relations(2 extra would be enough if the relations were 1->n).
Tables:
image(id, file_name)
genre(id, name)
image_genres(image_id, genre_id)
stars(id, name, ...)
image_stars(image_id, star_id)
And some data in tables:
image table
id
file_name
1
/users/home/song/empire.png
2
/users/home/song/promiscuous.png
genre table
id
name
1
pop
2
blues
3
rock
image_genres table
image_id
genre_id
1
2
1
3
2
1
stars table
id
name
1
Jay-Z
2
Alicia Keys
3
Nelly Furtado
4
Timbaland
image_stars table
image_id
star_id
1
1
1
2
2
3
2
4
For unique actor count in database you can simply run the sql query below
SELECT COUNT(name) FROM stars

Create a view with multiple foreign key referencing a single field

How can I create a view using a table which has multiple foreign key referencing the same table and a single field. I have product table and Reference table I have around 5 foreign key in product table referencing to the RefCodeKey Field in reference table. How can I create a view which shows product reference Code joining product and Reference Code
I have a product table as follows
PK PTK PC PN RCKey PSKey PCKey PCAKey
1 1 500000 Prod A 5 12 14 98
2 1 500001 Prod B 5 12 14 98
3 1 500002 Prod C 5 11 13 145
4 4 500002 Prod C 10 11 13 76
5 3 500002 Prod C 10 11 13 95
6 1 500005 Prod D 5 12 14 137
I have Reference Code Table as follows
RefCodeKey RefCodeType Code Label Status
1 ParentTypeKey assembly assembly Active
2 ParentTypeKey WHL WHL Active
3 ParentTypeKey TIRE TIRE Active
4 ParentTypeKey TIRE TIRE Active
5 RegionCodeKey 1 COMP 1 Active
6 RegionCodeKey 2 COMP 2 Active
7 RegionCodeKey 3 COMP 3 Active
8 RegionCodeKey 4 COMP 4 Active
9 RegionCodeKey 9 COMP 5 Active
10 RegionCodeKey 0 COMP 6 Active
11 ProductStatusKey CLOSED CLOSED Active
12 ProductStatusKey ACTIVE ACTIVE Active
13 ProductClassificationKey DropShip DropShipActive
14 ProductClassificationKey INFO NA INFO NA Active
How can i create a view display a result as show below?
PC PN RCKey PSKey PCKey
500000 Prod A COMP 1 ACTIVE INFO NA
500001 Prod B COMP 1 ACTIVE INFO NA
500002 Prod C COMP 1 CLOSED DropShip
500002 Prod C COMP 6 CLOSED DropShip
500002 Prod C COMP 6 CLOSED DropShip
500005 Prod D COMP 1 ACTIVE INFO NA
This is a common reporting pattern wherever the database architect has employed the "one true lookup table" model. I'm not going to get bogged down in the merits of that design. People like Celko and Phil Factor are far more erudite than me at commenting on these things. All I'll say is that having reported off over sixty enterprise databases in the last 15 years, that design is pervasive. Rightly or wrongly, you're probably going to see it over and over again.
There is currently insufficient information to definitively answer your question. The answer below makes assumptions on what I think is the most likely missing information is.
I'll assume your product table is named PRODUCT
I'll assume your all-powerful lookup table is call REFS
I'll assume RefCodeKey in REFS has a unique constraint on it, or it is the a primary key
I'll assume the REFS table is relatively small (say < 100,000 rows). I'll come back to this point later.
I'll assume that the foreign keys in the PRODUCT table are nullable. This affects whether we INNER JOIN or LEFT JOIN.
SELECT prod.PC
,prod.PN
,reg_code.label as RCKey
,prod_stat.label as PSKey
,prod_clas.label as PCKey
FROM PRODUCT prod
LEFT JOIN REFS reg_code ON prod.RCKey = reg_code.RefCodeKey
LEFT JOIN REFS prod_stat ON prod.PSKey = prod_stat.RefCodeKey
LEFT JOIN REFS prod_clas ON prod.PCKey = prod_clas.RefCodeKey
;
The trick is that you can refer to the REFS table as many times as you like. You just need to give it a different alias and join it to the relevant FK each time. For example reg_code is an alias. Give your aliases meaningful names to keep your code readable.
Note: Those RCKey/PSKey/PCKey names are really not good names. They'll come back to bite you. They don't represent the key. They represent a description of the thing in question. If it's a region code, call it region_code
The reason I'm assuming the REFS table is relatively small, is that if it's really large (I've seen one with 6 million lookup values across hundreds of codesets) and indexed to take RefCodeType into consideration, you might get better performance by adding a filter for RefCodeType to each of your LEFT JOINs. For example:
LEFT JOIN REFS prod_clas ON prod.PCKey = prod_clas.RefCodeKey
AND prod_clas.RefCodeType = 'ProductClassificationKey'

How to model arbitrarily ordering items in database?

I accepted a new feature to re-order some items by using Drag-and-Drop UI and save the preference for each user to the database. What's the best way to do so?
After reading some questions on StackOverflow, I found this solution.
Solution 1: Use decimal numbers to indicate order
For example,
id item order
1 a 1
2 b 2
3 c 3
4 d 4
If I insert item 4 between item 1 and 2, the order becomes,
id item order
1 a 1
4 d 1.5
2 b 2
3 c 3
In this way, every new order = order[i-1] + order[i+1] / 2
If I need to save the preference for every user, then I need to another relationship table like this,
user_id item_id order
1 1 1
1 2 2
1 3 3
1 4 1.5
I need num_of_users * num_of_items records to save this preference.
However, there's a solution I can think of.
Solution 2: Save the order preference in a column in the User table
This is straightforward by adding a column in the User table to record the order. Each value would be parsed as an array of item_ids that ranked by the index of the array.
user_id . item_order
1 [1,4,2,3]
2 [1,2,3,4]
Is there any limitation of this solution? Or is there any other ways to solve this problem?
Usually, an explicit ordering deals with the presentation or some specific processing of data. Hence, it's a good idea to separate entities of theirs presentation/processing. For example
users
-----
user_id (PK)
user_login
...
user_lists
----------
list_id, user_id (PK)
item_index
item_index can be a simply integer value :
ordered continuously (1,2...N): DELETE/INSERT of the whole list are normally required to change the order
ordered discretely with some seed (10,20...N): you can insert new items without reordering the whole list
Another reason to separate entity data and lists: reordering lists should be done in transaction that may lead to row/table locks. In case of separated tables only data in list table is impacted.

Database design for multiple similar types?

Say I have two question types: Multiple Choice and Range. A Range question allows users to answer by specifying a range of values in their answer (1-10 or 2-4 for example).
I inherited a database where the answers to these question types are stored in the same table which is structured like so:
Answers
-------
Id
QuestionId
choice
range_from
range_to
This results in data like below:
1 1 null 1 10
2 1 null 2 4
3 2 Pants null null
4 2 Hat null null
Does it make sense to include columns from every answer type in the answer table? Or should they be broken out into separate tables?
This is a very slimmed-down version of my real database. In reality there are about 8 question types, so with every answer there are several columns that are left unused.
Does it make sense to include columns from every answer type in the answer table?
This is "all classess in the same table" strategy for implementing inheritance, which is suitable for small number of classes. As the number of classes grows, you might consider one of the other strategies. There is no predefined "cut-off point" for that - you'll have to measure and decide for yourself.
The alternative would be an EAV-like system as proposed by blotto, but that would shift the enforcement of data consistency away from the DBMS. This is a valid solution if you don't know the structure of data at design-time and want to avoid DML at run-time, but if you do know the structire of data at design-time better stick with inheritance.
You could have a single field that represents the 'type' of question, that seems best suited in the Question table ( not the Answer table). For example:
question_type ENUM('choice', 'range', 'type_3', 'type_4'..)
Then make a one-to-many link ( a join table ) that represents the Question-to-Answers relationship
AnswerId (pk) | QuestionId (fk)
1 1
2 1
3 2
4 2
Finally, your Answer table is a collection of values for each Answer . It can designate each record more specifically by having its own ENUM.
answer_type ENUM('low_range', 'high_range', 'choice', etc)
Id (pk)| AnswerId (fk) | Type | Value
1 1 low_range 1
2 1 high_range 10
3 2 low_range 2
4 2 high_range 4
5 3 choice Pants
6 4 choice Hat
This is much more scalable, and basically pivots the fields in your previous table to values in the answers table. So you can always add new 'Type's both for questions an answers without adding new fields to the schema.

Should I add a common property of foreign keys to my table?

I have a database of test data that have been collected on behalf of agents. The test data are grouped together (after the fact) into result sets. As the tests come in, they are stored in the database with the ID of the corresponding agent:
TEST_ID TEST_OWNER TIMESTAMP RESULT_ID
1 1 0 null
2 1 15 null
3 2 30 null
4 2 32 null
5 1 34 null
The result sets are generated at a later time in such a way that groups tests that took place during a similar time frame. This judgment cannot be made as the tests come in.
RESULT_ID
1
2
3
All of the tests in a result set must belong to the same owner. I can ensure this (in code) as I assign the result IDs to the tests in my later operation, but some things would be easier if I had a TEST_OWNER field in my result set table.
Would adding this field be a violation of some normalization goal? The TEST_OWNER information will be duplicated, even though one instance of it is really implicit. I'm not a DBA, and I don't want to do things that are bad style.
Jim I am not completely sure if you are saying this is a table in your DB??
TEST_ID TEST_OWNER TIMESTAMP RESULT_ID
1 1 0 null
2 1 15 null
3 2 30 null
4 2 32 null
5 1 34 null
If so the first thing I would do is pull the result attribute out of this table to achieve normalization. Or is this your Result table?
Regardless are these results being derived from from other data in the DB? If so I don't see the need to duplicate things and store the results (calculated) also. Just derive as needed and keep the DB clean.
If you need further info I need a better understanding of what you are presenting.

Resources