Friendship Website Database Design - sql-server

I'm trying to create a database for a frienship website I'm building. I want to store multiple attributes about the user such as gender, education, pets etc.
Solution #1 - User table:
id | age | birth day | City | Gender | Education | fav Pet | fav hobbie. . .
--------------------------------------------------------------------------
0 | 38 | 1985 | New York | Female | University | Dog | Ping Pong
The problem I'm having is the list of attributes goes on and on and right now my user table has 20 something columns.
I feel I could normalize this by creating another table for each attribute see below. However this would create many joins and I'm still left with a lot of columns in the user table.
Solution #2 - User table:
id | age | birth day | City | Gender | Education | fav Pet | fav hobbies
--------------------------------------------------------------------------
0 | 38 | 1985 | New York | 0 | 0 | 0 | 0
Pets table:
id | Pet Type
---------------
0 | Dog
Anyone have any ideas how to approach this problem it feels like both answers are wrong. What is the proper table design for this database?

There is more to this than meets the eye: First of all - if you have tons of attributes, many of which will likely be null for any specific row, and with a very dynamic selection of attributes (i.e. new attributes will appear quite frequently during the code's lifecycle), you might want to ask yourself, whether a RDBMS is the best way to materialize this ... essentially non-schema. Maybe a document store would be a better fit?
If you do want to stay in the RDBMS world, the canonical answer is to have either one or one-per-datatype property table plus a table of properties:
Users.id | .name | .birthdate | .Gender | .someotherfixedattribute
----------------------------------------------------------
1743 | Me. | 01/01/1970 | M | indeed
Propertytpes.id | .name
------------------------
234 | pet
235 | hobby
Poperties.uid | .pid | .content
-----------------------------
1743 | 234 | Husky dog

You have a comment and an answer that recommend (or at least suggest) and Entity-Attribute-Value (EAV) model.
There is nothing wrong with using EAV if your attributes need to be dynamic, and your system needs to allow adding new attributes post-deployment.
That said, if your columns and relationships are all known up front, and they don't need to be dynamic, you are much better off creating an explicit model. It will (generally) perform better and will be much easier to maintain.

Instead of a wide table with a field per attribute, or many attribute tables, you could make a skinny table with many rows, something like:
Attributes (id,user_id,attribute_type,attribute_value)
Ultimately the best solution depends greatly on how the data will be used. People can only have one DOB, but maybe you want to allow for multiple addresses (billing/mailing/etc.), so addresses might deserve a separate table.

Related

Database Design - Need help scaling query

I'm trying to find the best data-structure/data store solution (highest performance) for the following request:
I have a list of attributes that I need to store for all individual in the US, for example:
+------------+-------+-------------+
| Attribute | Value | SSN |
+------------+-------+-------------+
| hair color | black | 123-45-6789 |
| eye color | brown | 123-45-6789 |
| height | 175 | 123-45-6789 |
| sex | M | 123-45-6789 |
| shoe size | 42 | 123-45-6789 |
As you can guess, with the general population, there are nothing unique and identifiable from those attributes.
However, let's assume that if we were to fetch from a combination of 3 or 4 attributes, then I would be able to uniquely identify a person (find their SSN).
Now here's the difficulties, the set of combinations that can uniquely identify a person will evolve over time and be adjusted.
What would be my best bet for storing and querying the data with the scenario mentioned above, that will remain highly performant (<100ms) at scale?
Current attempt with combining two attributes:
SELECT * FROM (SELECT * FROM people WHERE hair='black') p1
JOIN (SELECT * FROM people WHERE height=175) p2
ON p1.SSN = p2.SSN
But with a database with millions of rows, as you can guess.. NOT performant.
Thank you!
if the data store is not a constraint, I would use a DocumentDB, something like MongoDB, CosmosDB or even ElasticSearch.
With Mongo, for example, you could leverage its schemaless nature and have a collection of People with one property per "attribute":
{
"SSN": "123-45-6789",
"eyeColor": "brown",
"hairColor" "blond",
"sex": "M"
}
documents in this collection might have different properties, but it's not an issue. All you have to do now is to put an index on each one and run your queries.

Database Design - Drop Down Input Box Issue

I'm trying to create a friendship site. The issue I'm having is when a user joins a website they have to fill out a form. This form has many fixed drop down items the user must fill out. Here is an example of one of the drop downs.
Drop Down (Favorite Pets)
Items in Favorite Pets
1. Dog
2. Cat
3. Bird
4. Hampster
What is the best way to store this info in a database. Right now the profile table has a column for each fixed drop down. Is this correct database design. See Example:
User ID | Age | Country | Favorite Pet | Favorite Season
--------------------------------------------------------------
1 | 29 | United States | Bird | Summer
Is this the correct database design? right now I have probably 30 + columns. Most of the columns are fixed because they are drop down and the user has to pick one of the options.
Whats the correct approach to this problem?
p.s. I also thought about creating a table for each drop down but this would really complex the queries and lead to lots of tables.
Another approach
Profile table
ID | username | age
-------------------
1 | jason | 27
profileDropDown table:
ID | userID | dropdownID
------------------------
1 | 1 | 2
2 | 1 | 7
Drop Down table:
ID | dropdown | option
---------------------
1 | pet | bird
2 | pet | cat
3 | pet | dog
4 | pet | Hampster
5 | season | Winter
6 | Season | Summer
7 | Season | Fall
8 | Season | spring
"Best way to approach" or "correct way" will open up a lot of discussion here, which risks this question being closed. I would recommend creating a drop down table that has a column called "TYPE" or "NAME". You would then put a unique identifier of the drop down in that column to identify that set. Then have another column called "VALUE" that holds the drop down value.
For example:
ID | TYPE | VALUE
1 | PET | BIRD
2 | PET | DOG
3 | PET | FISH
4 | SEASON | FALL
5 | SEASON | WINTER
6 | SEASON | SPRING
7 | SEASON | SUMMER
Then to get your PET drop down, you just select all from this table where type = 'PET'
Will the set of questions (dropdowns) to be asked every user ever be changed? Will you (or your successor) ever need to add or remove questions over time? If no, then a table for users with one column per question is fine, but if yes, it gets complex.
Database purists would require two tables for each question:
One table containing a list of all valid answers for that question
One table containing the many to many relation between user and answer to “this” question
If a new question is added, create new tables; if a question is removed, drop those tables (and, of course, adjust all your code. Ugh.) This would work, but it's hardly efficient.
If, as seems likely, all the questions and answer sets are similar, then a three-table model suggests itself:
A table with one row per question (QuestionId, QuestionText)
A table with one row for each answer for each Question (QuestionId, AnswerId, AnswerText)
A table with one row for each user-answered question (UserId, QuestionId, AnswerId)
Adding and removing questions is straightforward, as is identifying skipped or unanswered questions (such as, if you add a new question a month after going live).
As with most everything, there’s a whole lot of “it depends” behind this, most of which depends on what you want your system to do.

Which of these definitions explain 1NF?

I found it vague when i'm trying to look for the definition of 1NF in google.
Some of the sites like this one, says the table is in 1st normal form when it doesn't have any repetitive set of columns.
Some others (most of them) says there shouldn't be multiple values of the same domain exist in the same column.
and some of them says, all tables should have a primary key but some others doesn't talk about primary key at all !
can someone explain this for me ?
A relation is in first normal form if it has the property that none of its domains has elements which are themselves sets.
From E. F. Codd (Oct 1972). "Further normalization of the database relational model"
This really gets down to what it is about, but the guy who invented the relational database model.
When something is in the first normal form, there are no columns which themselves contain sets of data.
The wikipedia article on first normal form demonstrates this with a denormalized table:
Example1:
Customer
Customer ID | First Name | Surname | Telephone Number
123 | Robert | Ingram | 555-861-2025
456 | Jane | Wright | 555-403-1659, 555-776-4100
789 | Maria | Fernandez | 555-808-9633
This table is denormalized because Jane has a telephone number that is a set. Writing the table thus is still in violation of 1NF.
Example2:
Customer
Customer ID | First Name | Surname | Telephone Number
123 | Robert | Ingram | 555-861-2025
456 | Jane | Wright | 555-403-1659
456 | Jane | Wright | 555-776-4100
789 | Maria | Fernandez | 555-808-9633
The proper way to normalize the table is to break it out into two tables.
Example3:
Customer
Customer ID | First Name | Surname
123 | Robert | Ingram
456 | Jane | Wright
789 | Maria | Fernandez
Phone
Customer ID | Telephone Number
123 | 555-861-2025
456 | 555-403-1659
456 | 555-776-4100
789 | 555-808-9633
Another way of looking at 1NF is as defined by Chris Date (from Wikipedia):
There's no top-to-bottom ordering to the rows.
There's no left-to-right ordering to the columns.
There are no duplicate rows.
Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else).
All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps].
Example2 lacks a unique key which is in violation of rule 3. Example1 violates rule 4 in that the telephone number contains multiple values.
Only Example3 fills all those requirements.
Further reading:
Simple Guide to Five Normal Forms in Relational Database Theory
The simplest explanation I have found is this modified definition copied from here:
1st Normal Form Definition
A database is in first normal form if it satisfies the following conditions:
1) Contains only atomic values
2) There are no repeating groups

Which is a better database schema for a tracking tool?

I have to generate a view that shows tracking across each month. The ultimate view will be something like this:
| Person | Task | Jan | Feb | Mar| Apr | May | June . . .
| Joe | Roof Work | 100% | 50% | 50% | 25% |
| Joe | Basement Work | 0% | 50% | 50% | 75% |
| Tom | Basement Work | 100% | 100% | 100% | 100% |
I already have the following tables:
Person
Task
I am now creating a new table to foreign key into the above 2 tables and i am trying to figure out the pros and cons of creating 1 or 2 tables.
Option 1:
Create a new table with the following Columns:
Id
PersonId
TaskId
Jan2012
Feb2012
Mar2012
Apr2013
or
Option 2:
have 2 seperate tables
One table for just
Id
PersonId
TaskId
and another table for just the following columns
Id
PersonTaskId (the id from table above)
MonthYearKey
MonthYearValue
So an example record would be
| 1 | 13 | Jan2011 | 100% |
where 13 would represent a specific unique Person and Task combination. This second way would avoid having to create new columns to continue over time (which seems right) but i also want to avoid overkill.
which would be a more scalable way to have this schema. Also, any other suggestions or more elegant ways of doing this would be great as well?
You can have a m2m table with data columns. I don't see a reason why you can't just put MonthYearKey, MonthYearValue on the same table with PersonId and TaskId
Id
TaskId
PersonId
MonthYearKey
MonthYearValue
It's possible too that you would want to move the MonthYearKey out into their own table, it really just comes down to common queries and what this data is used for.
I would note, you never want to design a schema where you are adding columns due to time. The first option would require maintenance all the time, and would become very difficult to query also.
Option 2 is definitely more scalable and is not overkill.
Option 1 would require you to add a new column every month and simple date based queries of your data would not be possible, e.g. Show me all people who worked at least 90% in any month last year.
The ultimate view would be generated from a particular query or view of your data.

How to design database, one table for all product type or each table for each product type?

I start new e-commerce web application (pet project) that sale both t-shirt and shoe. My store has only free size T shirt so t-shirt has only color column while shoe has columns for size and color.
Now it's time to create table to store that data, I want to know is it good to create separate table for shoe and t-shirt or it's better to keep all of them in one table?
If it has a better idea to store such data, please let me know.
You definitely don't want to create a Shoe table and a TShirt table. Your shop might grow, and one day you'll have a thousand such product tables. Writing SQL for that would be a nightmare. Plus, you might have different kinds of t-shirts eventually, some with color, some with size and color, and so on. If you create a new table for each, you'll lose track of them quickly, and if you don't, why have separate tables for t-shirts and shoes, but not for one-size t-shirts and multi-size t-shirts?
While designing your database, you should be asking yourself: what are the entities in my realm? what are the things that never change and are uniquely identifiable? In a shop, a particular item that can be sold at a particular price is one such entity. So you might have a products table that has a key for each particular item you sell, and maybe a name, a type, a size and a color column:
item
id | type | name | size | color
------------------------------------
1 | shoe | Marathon | 9 | white
2 | shoe | Marathon | 9 | black
Looking at this table, you notice that we have two entries for the highly successful Marathon running shoe, and that seems to be a normalization violation. Indeed, you probably have two entities a shippable item and a catalog product. The shoe "Marathon" is probably something that has one picture and one description in your store, followed by a "available in the following colors and sizes:" line. So now you have two tables:
product
id | type | name | supplier | picture | description
--------------------------------------------------------------------------------------
1 | shoe | Marathon | TrackNField Co. | marathon.jpg | Run faster than light!
2 | tshirt | FlowerPower | SF Shirts | fpower.jpg | If you're going to San Francisco...
item
id | product_id | size | color | price
--------------------------------------
1 | 1 | 9 | white | 99.99
2 | 1 | 9 | black | 99.99
3 | 2 | | blue | 19.99
The "type" column in the product table can be a tricky one. You'll probably want to display products by category, let the user click on "shoes" and get all products with type "shoe". Easy so far, but eventually someone will mistype an entry "sheo", and then you can't find that product under shoes anymore. So it's better to separate the categorization from the products, for example by having a product_type table:
product_type
id | name
---------
1 | shoe
product
id | type_id | name | supplier | picture | description
--------------------------------------------------------------------------------------
1 | 1 | Marathon | TrackNField Co. | marathon.jpg | Run faster than light!
with a reference to the type in the product table. That's ok as long as your type hierarchy stays shallow, but what if you want to have subcategories, like "sneaker", "basketball shoe", "suede shoe", and so on? One shoe might even belong to several of these subcategories. In that case you can try this
category
id | name | supercategory_id
------------------------------------
1 | shoe |
2 | running shoe | 1
product_category
product_id | category_id
------------------------
1 | 2
product
id | name | supplier | picture | description
--------------------------------------------------------------------------------------
1 | Marathon | TrackNField Co. | marathon.jpg | Run faster than light!
And if you want to display multiple hierarchies of categorizations (as most big ecommerce sites do these days), you'll have to come up with something even more sophisticated.
Keep them all in one table and have a type field. The reason to do it this way is so that your data structure is scalable: i.e. if there is a new type of product then instead of adding a new table and having to drastically change your application code, you just use the same table and simply add a type.
if you don't want make it to complex you can keep all of them in the same table and create another table called "ProductType" that tells you if it is a shoe or a t-shirt.
The relationship will be One-To-Many on the "ProductType" side as you can have the same type of product associated with more then one record on the product table(where you store all your products)

Resources