Lookup tables for basic user input? - database

Data like Birth month, day and year, user's age, Gender/Sex, etc. Should these be stored as text or ID based in database? ID based means they will have lookup values. Usage is for example: User signup will record age, user profiles will have a seeking partner age, etc so age and other data can be used in multiple places. In backend there will be analytic which is pushing me to use lookup tables for even small things like Gender which have only 2-3 values.

You will want to reference the datatypes or database has avilable. For mysql: http://dev.mysql.com/doc/refman/5.0/en/data-types.html
Do not use any of the fields you mentioned as the primary key. Either create an 'id' column or use the user's username.
Here are database for the fields:
birthDate = date
age = tinyint (Technically you don't need this since you can always determine it based on their birthDate and current date. It depends what you are doing)
gender = enum

I wouldn't bother with lookup tables for the fields you mentioned. Birth month, day, year could all be encapsulated in a single date field (w/ date type, not text), and then split out with database functions if you need. Age is just a number, which is all your id field will be, so not much point in making a lookup for that unless you want to actually limit the age range, in which case you could use a check constraint instead of a lookup if you needed to limit to an actual range (age >= 1 and age <= 20 for instance). Gender/sex is the only one I might consider a lookup on, but since there are so few possible values, a check constraint should suffice.
Oh, and I don't know what analytic you're using, but any analysis software worth its salt can produce field domains (lists of unique values in a table's fields) on its own, especially for the fields you mentioned. If you're building your own analytic (not sure, based on your comment to another poster's solution), you can easily make queries to do what you're talking about.

Related

ER diagram that implements a database for trainee

I edited and remade the ERD. I have a few more questions.
I included participation constraints(between trainee and tutor), cardinality constraints(M means many), weak entities (double line rectangles), weak relationships(double line diamonds), composed attributes, derived attributes (white space with lines circle), and primary keys.
Questions:
Apparently to reduce redundant attributes I should only keep primary keys and descriptive attributes and the other attributes I will remove for simplicity reasons. Which attributes would be redundant in this case? I am thinking start_date, end_date, phone number, and address but that depends on the entity set right? For example the attribute address would be removed from Trainee because we don't really need it?
For the part: "For each trainee we like to store (if any) also previous companies (employers) where they worked, periods of employment: start date and end date."
Isn't "periods of employment: start date, end date" a composed attribute? because the dates are shown with the symbol ":" Also I believe I didn't make an attribute for "where they worked" which is location?
Also how is it possible to show previous companies (employers) when we already have an attribute employers and different start date? Because if you look at the Question Information it states start_date for employer twice and the second time it says start_date and end_date.
I labeled many attributes as primary keys but how am I able to distinguish from derived attribute, primary key, and which attribute would be redundant?
Is there a multivalued attribute in this ERD? Would salary and job held be a multivalued attribute because a employer has many salaries and jobs.
I believe I did the participation constraints (there is one) and cardinality constraints correctly. But there are sentences where for example "An instructor teaches at least a course. Each course is taught by only one instructor"; how can I write the cardinality constraint for this when I don't have a relationship between course and instructor?
Do my relationship names make sense because all I see is "has" maybe I am not correctly naming the actions of the relationships? Also I believe schedules depend on the actual entity so they are weak entities.... so does that make course entity set also a weak entity (I did not label it as weak here)?
For the company address I put a composed attribute, street num, street address, city... would that be correct? Also would street num and street address be primary keys?
Also I added the final mark attribute to courses and course_schedule is this in the right entity set? The statement for this attribute is "Each trainee identified by: unique code, social security number, name, address, a unique telephone number, the courses attended and the final mark for each course."
For this part: "We store in the database all classrooms available on the site" do i make a composed attribute that contains site information?
Question Information:
A trainee may be self-employed or employee in a company
Each trainee identified by:
unique code, social security number, name, address, a unique
telephone number, the courses attended and the final mark for each course.
If the trainee is an employee in a company: store the current company (employer), start date.
For each trainee we like to store (if any) also previous companies (employers) where they worked, periods of employment: start date and end date.
If a trainee is self-employed: store the area of expertise, and title.
For a trainee that works for a company: we store the salary and job
For each company (employer): name (unique), the address, a unique telephone number.
We store in the database all known companies in the
city.
We need also to represent the courses that each trainee is attending.
Each course has a unique code and a title.
For each course we have to store: the classrooms, dates, and times (start time, and duration in minutes) the course is held.
A classroom is characterized by a building name and a room number and the maximum places’ number.
A course is given in at least a classroom, and may be scheduled in many classrooms.
We store in the database all classrooms
available on the site.
We store in the database all courses given at least once in the company.
For each instructor we will store: the social security number, name, and birth date.
An instructor teaches at least a course.
Each course is taught by only one instructor.
All the instructors’ telephone numbers must also be stored (each instructor has at least a telephone number).
A trainee can be a tutor for one or many trainees for a specific
period of time (start date and end date).
For a trainee it is not mandatory to be a tutor, but it is mandatory to have a tutor
The attribute ‘Code’ will be your PK because it’s only use seems to be that of a Unique Identifier.
The relationship ‘is’ will work but having a reference to two tables like that can get messy. Also you have the reference to "Employers" in the Trainee table which is not good practice. They should really be combined. See my helpful hints section to see how to clean that up.
Company looks like the complete table of Companies in the area as your details suggest. This would mean table is fairly static and used as a reference in your other tables. This means that the attribute ‘employer’ in Employed would simply be a Foreign Key reference to the PK of a specific company in Company. You should draw a relationship between those two.
It seems as though when an employee is ‘employed’ they are either an Employee of a company or self-employed.
The address field in Company will be a unique address your current city, yes, as the question states the table is a complete list of companies in the city. However because this is a unique attribute you must have specifics like street address because simply adding the city name will mean all companies will have the same address which is forbidden in an unique field.
Some other helpful hints:
Stay away from adding fields with plurals on them to your diagram. When you have a plural field it often means you need a separate table with a Foreign Key reference to that table. For example in your Table Trainee, you have ‘Employers’. That should be a Employer table with a foreign key reference to the Trainee Code attribute. In the Employer Table you can combine the Self-employed and Employed tables so that there is a single reference from Trainee to Employer.
ERD Link http://www.imagesup.net/?di=1014217878605. Here's a quick ERD I created for you. Note the use of linker tables to prevent Many to Many relationships in the table. It's important to note there are several ways to solve this schema problem but this is just as I saw your problem laid out. The design is intended to help with normalization of the db. That is prevent redundant data in the DB. Hope this helps. Let me know if you need more clarification on the design I provided. It should be fairly self explanatory when comparing your design parameters to it.
Follow Up Questions:
If you are looking to reduce attributes that might be arbitrary perhaps phone_number and address may be ones to eliminate, but start and end dates are good for sorting and archival reasons when determining whether an entry is current or a past record.
Yes, periods_of_employment does not need to be stored as you can derive that information with start and end dates. Where they worked I believe is just meant to say previous employers, so no location but instead it’s meant that you should be able to get a list all the employers the trainee has had. You can get that with the current schema if you query the employer table for all records where trainee code equals requested trainee and sort by start date. The reason it states start_date twice is to let you know that for all ‘previous’ employers the record will have a start and end date. Hence the previous. However, for current employers the employment hasn't ended which means there will be no end_date so it will null. That’s what the problem was stating in my opinion.
To keep it simple PK’s are unique values used to reference a record within another table. Redundant values are values that you essentially don’t need in a table because the same value can be derived by querying another table. In this case most of your attributes are fine except for Final_Mark in the Course table. This is redundant because Course_Schedule will store the Final_Mark that was received. The Course table is meant to simply hold a list of all potential courses to be referenced by Course_Schedule.
There is no multivalued attributes in this design because that is bad practice Job and salary are singular and if and job or salary changes you would add a new record to the employer table not add to that column. Multivalued attributes make querying a db difficult and I would advise against it. That’s why I mentioned earlier to abstract all attributes with plurals into their own tables and use a foreign key reference.
You essentially do have that written here because Course_Schedule is a linker table meaning that it is meant to simplify relationships between tables so you don’t have many to many relationships.
All your relationships look right to me. Also since the schedules are linker tables and cannot exist without the supporting tables you could consider them weak entities. Course in this schema is a defined list of all courses available so can be independent of any other table. This by definition is not a weak entity. When creating this db you’d probably fill in the course table and it probably wouldn’t change after that, except rarely when adding or removing an available course option.
Yes, you can make address a composite attribute, and that would be right in your diagram. To be clear with your use of Primary key, just because an attribute is unique doesn’t make it a primary key. A table can have one and only one primary key so you must pick a column that you are certain will not be repeated. In this example you may think street number might be unique but what if one company leaves an address and another company moves into that spot. That would break that tables primary key. Typically a company name is licensed in a city or state so cannot be repeated. That would be a better choice for your primary key. You can also make composite primary keys, but that is a more advanced topic that I would recommend reading about at a later date.
Take final_mark out of courses. That’s table will contain rows of only courses, those courses won’t be linked to any trainee except by course_schedule table. The Final_Mark should only be in that table. If you add final_mark to Course table then, if you have 10 trainees in a course, You’d have 10 duplicate rows in the course table with only differing final_marks. Instead only hold the course_code and title that way you can assign different instructors, trainees and classrooms using the linker tables.
No composite attribute is needed using this schema. You have a Classroom table that will hold all available classrooms and their relevant information. You then use the Classroom_Schedule linker table to assign a given Classroom to a Course_Schedule. No attributes of Classroom can be broken down to simpler attributes.

Keep referenced field data changes

I have a table Salary with a column PersonalId and a table Person with a column Name.
In the first table salary data will saved with a PersonalId which relates it to the Person table. In salary bill all data will gather together and Person name will be referenced from Person table.
After 1 year a specific person name will change from Michael to Maic. Now I want the last year salaries bill remain with previous person name Michael and the new salaries bill generate by new name Maic.
How we can do that?
It could depend on what type of operation you need to to most and on how much people change their name, because the number of joins you may need to make could vary a lot.
keep a field in Person that points to the next Person which is a change of name
keep another key in Person that varies only for the physical person
keep a limited number of names in Person that someone could dispose of, with an index of the current name
in another table you keep the relations between the various name of the Person
It could depend on what rules of normalization you follow, for now I'm not thinking about that.
Anyway, with the first case you don't need to change Salary, but to reconstruct the identity of a Person you need multiple requests or at least a stored procedure.
In the second case you still don't need to change Salary because you add a field to Person, but to get all the Salary entries for that physical person you'll need some work, again probably a stored procedure to get the added field and then something that joins all the Salary entries.
The third maybe is the simplest, but also the limited one, and you need in Salary another field that tells the index of the name to use in that entry.
The last case gives you a stable identity, but it may need some work because of the added table, and still there are multiple implementations. You could have salary reference that table instead of Person, or you could consult that table only when you need all the data, but you cannot reference its primary key from Salary because it would not permit to discriminate the name.
Lunadir's right in a certain way -- but all of those approaches are complex, and of rather great difficulty.
The other way -- simpler, and perhaps more correct & robust -- is to keep NAME and PAID_DATE columns in Salary or SalaryPaid, and write the actual name & date paid at the time the payment is made.
Good old batch-processing style -- and it has the benefit of actually capturing the key financial facts, of what payment was made & what name it was made to, which are the actual auditable transaction history.
Do you pay each Salary entry individually, or in bunch (PaySlip or SalaryPaid)? Put the NAME column wherever you record the actual payment & timestamp it occurred.

Database 3rd Normal Form

Can somebody make sure my database is in the 3rd normal form, and if not, explain why not?
I need the DB to only have 3 tables. So here it is:
Customer No. (PK) Store No. (PK) Sale No. (PK)
Name Location Customer No. (FK)
Telephone Revenue Store No. (FK)
Address Total
Purchases($) Paid
Store No.
Here are what your three tables should be: Table 1: Customer
Customer No. (PK)
Name
Telephone
Address
then Table 2: Store
Store no. (PK)
Location
then Table 3: Sale
Sale No. (PK)
Customer No (FK)
Store No (FK)
Total
Paid_yes_no
If you are trying to track partial payments, etc. through the Paid column, then it would be a separate table (and a much more complicated database). However, if your Paid column just indicates whether the bill has been paid or not, the above should work.
Perhaps you need a Date field there as well?
There are a few problems, some of this may be as the specification is for homework rather than the real world.
Store No. in Customers is a duplicated column It's reasonable to expect that a business with multiple stores have customers who use multiple stores - unless your specification says otherwise and in which case the taxonomy (naming) should be expanded, you could consider First Store No. or Home Store No. instead. Also it should be marked up as a foreign key if it is to remain.
Purchases($) in the customers table is reliant on other data which will change. Since it is derived from other information you should not store it.
Address is not a single column - it has multiple parts like Street, City, State, Country & ZIP Code may in turn require extra tables details to fully satisfy 2nd Normal Form. Similarly Telephone is unlikely to be just one number.
Each thing you need to know should appear once and once only. If you can calculate it from something else you should do that rather than store the answer. In the real world you might sometimes cache some information in a table or batch process it for performance, but those would be applied later and only if necessary.
A brief overview of database normalisation is at http://databases.about.com/od/specificproducts/a/normalization.htm which you should probably look through before reworking your project.

Ensuring referential integrity between Columns in a Table

Currently I have a table model with the following format:
criteria (criteria_id, criteria_name)
criteria_data(criteria_id, value)
I intend on storing Date-related information into the table in the sense that one criteria ( the Date criteria) could just contain the single date in the criteria_data table while other criteria_data could be the price of the stock for the date in a separate row. ( Another complication is that: the name of the stock is also a criteria)
My problem:
How is it possible for me to ensure that only 1 price ( single row criteria) can be entered into the table for a particular date and stock name ( 2 other separate criteria and rows).
I really don't want to enforce this in the App layer so I am mainly looking for DB Layer solutions , if available.
I am also open to being told to scrap my entire table model, if a more suitable Data model is suggested.
EDIT
After being informed of my folly ( see dPortas post below), I accept that this is not the smart way to go. I thought of a new model:
criteria_data(stockName,price, high,low,price,change)
While this is what it looks like, I am thinking the actual column names would be an identifier containing the criteria_id . For example, the stockname field could be col_1 and high could be col_3 but this would ensure that I could enforce integrity on the various columns.
What are people thoughts on this?
Your table design looks suspiciously like a case of EAV. Among the disadvantages of that anti-pattern are that you can't accurately store the right datatypes or apply constraints to it. I suggest you reconsider the design.
Suggested redesign: criteria (criteria_id, criteria_name, date, stock_name, price) key: (stock_name, date)

How do you handle "special-case" data when modeling a database?

Our organization provides a variety of services to our clients (e.g., web hosting, tech support, custom programming, etc...). There's a page on our website that lists all available services and their corresponding prices. This was static data, but my boss wants it all pulled from a database instead.
There are about 100 services listed. Only two of them, however, have a non numeric value for "price" (specifically, the strings "ISA" and "cost + 8%" - I really don't know what they're supposed to mean, so don't ask me).
I'd hate to make the "price" column a varchar just because of these two listings. My current approach is to create a special "price_display" field, which is either blank or contains the text to display in place of the price. This solution feels too much like a dirty hack though (it would needlessly complicate the queries), so is there a better solution?
Consider that this column is a price displayed to the customer that can contain anything.
You'd be inviting grief if you try to make it a numeric column. You're already struggling with two non-conforming values, and tomorrow your boss might want more...
PRICE ON APPLICATION!
CALL US FOR TODAYS SPECIAL!!
You get the idea.
If you really need a numeric column then call it internalPrice or something, and put your numeric constraints on that column instead.
When I have had to do this sort of thing in the past I used:
Price Unit Display
10.00 item null
100.00 box null
null null "Call for Pricing"
Price would be decimal datatype (any exact numeric, not float or real), unit and display would be some type of string data type.
Then used the case statement to display the price with either the price per unit or the display. Also put a constraint or trigger on the display column so that it must be null unless price is null. A constraint or trigger should also require a value in unit if price is not null.
This way you can calcuate prices for an order where possible and leave them out when the price is not specified but display both. I'd also put in a busness rule to make sure the total could not be totalled until the call for pricing was resolved (which you would also have to have a way to insert the special pricing to the order details rather than just pull from the price table).
Ask yourself...
Will I be adding these values? Will I be sorting by price? Will I need to convert to other currency values?
OR
Will I just be displaying this value on a web page?
If this is just a laundry list and not used for computation the simplest solution is to store price as a string (varchar).
Perhaps use a 'type' indicator in the main table, with one child table allowing numeric price and another with character values. These could be combined into one table, but I generally avoid that. You could also use an intermediate link table with a quantity if you ever want to base price on quantity purchased.
Lots of choices:
All prices stored as varchars
Prices stored numerically and extra price_display field that overrides the number if populated
Prices stored numberically and extra price_display field for display purposes populated manually or on trigger when numeric price is updated (duplication of data and it could get out of sync - yuk)
Store special case negative prices that map to special situations (simply yuk!!)
varchar price, prefix key field to a table of available prefixes ('cost +', ...), suffix key field to a table of available suffixes, type field key to a list of types for the value in price ('$', '%', 'description'). Useful if you'd need to write complex queries against prices in the future.
I'd probably go for 2 as a pragmatic solution, and an extension of 5 if I needed something very general for a generic pricing system.
If this is the extent of your data model, then a varchar field is fine. Your normal prices - decimal as they may be - are probably useless for calculations anyway. How do you compare $10/GB for "data transfer" and $25/month for "domain hosting"?
Your data model for this particular project isn't about pricing, but about displaying pricing. Design with that in mind.
Of course - if you're storing the price a particular customer paid for a particular project, or trying to figure out what to charge a particular customer - then you have a different (more complex) domain. And you'll need a different model to support that.
In that at least one of the alternate prices have a number involved, what about a Price column, a price type? The normal entries could be a number for the dollar value and type 'dollar', and the other could be 8 and 'PercentOverCost' and null and 'ISA' (for the Price and PriceType column).
You should probably have a PriceType table to validate and PriceTypeID if you go this route.
This would allow other types of pricing to be added in the future (unit pricing, foriegn currancy), give you a number, and also make it easier to know what type pricing you are dealing with..
http://books.google.co.in/books?id=0f9oLxovqIMC&pg=PA352&lpg=PA352&dq=How+do+you+handle+%E2%80%9Cspecial-case%E2%80%9D+data+when+modeling+a+database%3F&source=bl&ots=KxN9eRgO9q&sig=NqWPvxceNJPoyZzVS4AUtE-FF5c&hl=en&ei=V3RlSpDtI4bVkAWkzbHNDg&sa=X&oi=book_result&ct=result&resnum=3

Resources