I developing a tool which may got more than a million data to fill in.
current i have designed single table with 36 coloumns. my question is do I need to divide these into multiple tables or single??
If single what is the advantage and disadvantage
if multiple then what is the advantage and disadvantage
and what will be the engine to use for speed...
my concern is a large database which will have atleast 50000 queries perday..
any help??
Yes, you should normalize your database. A general rule of thumb is that if a column that isn't a foreign key contains duplicate values, the table should be normalized.
Normalization involves splitting your database into tables, and helps to:
Avoid modification anomolies.
Minimize impact of changes to the data structure.
Make the data model more informative.
There is plenty of information about normalization on Wikipedia.
If you have a serious amount of data and don't normalize, you will eventually come to a point where you will need to redesign your database, and this is incredibly hard to do retrospectively, as it will involve not only changing any code that accesses the database, but also migrating all existing data to the new design.
There are cases where it might be better to avoid normalization for performance reasons, but you should have a good understanding of normalization before making this decision.
First and foremost ask yourself are you repeating fields or attributes of fields. Does your one table contain relationships or attributes that should be separated. Follow third normal form...we need more info to help but generally speaking one table with thirty six columns smells like a db fart.
If you want to store a million rows of the same kind, go for it. Any decent database will cope even with much bigger tables.
Design your database to best fit the data (as seen from your application), get it up, and optimize later. You will probably find that performance is not a problem.
You should model your database according to the data you want to store. This is called "normalization": Essentially, each piece of information should only be stored once, otherwise a table cell should point to another row or table containing the value. If, for example, you have table containing phone numbers, and one column contains the area code, you will likely have more than one phone number with the same value in the same column. Once this happens, you should set up a new table for area codes and link to its entries by referencing the primary key of the row the desired area code is stored in.
So instead of
id | area code | number
---+-----------+---------
1 | 510 | 555-1234
2 | 510 | 555-1235
3 | 215 | 555-1236
4 | 215 | 555-1237
you would have
id | area code id | number | area code
---+---------- ---+----------+-----------
1 | 510 1 | 555-1234 | 1
2 | 215 2 | 555-1235 | 1
3 | 555-1236 | 2
4 | 555-1237 | 2
The more occurences of the same value you have, the more likely will you save memory and get quicker performance if you organize your data in this way, especially when you're handling string values or binary data. Also, if an area code would change, all you need to do is update a single cell instead of having to perform an update operation on the whole table.
Try this tutorial.
Correlation does not imply causation.
Just because shitloads of columns usually indicate a bad design, doesn't mean that a shitload of columns is a bad design.
If you have a normalized model, you store whatever number of columns you need a single table.
It depends!
Does that one table contain a single 'entity'? i.e. Are all 36 columns attributes a single thing, or are there several 'things' mixed together?
If mixed, then you should normalise (separate into distinct entities with relationships between them). You should aim for at least Third Normal Form (3NF).
A best practice is to normalise as much as you can; if you later identify a performance problem, then denormalise as little as you can.
Related
The setup.
I have a table that stores a list of physical items for a game. Items also have a hierarchical list of categories. Example base table:
Items
id | parent_id | is_category | name | description
-- | --------- | ----------- | ------- | -----------
1 | 0 | 1 | Weapon | Something intended to cause damage
2 | 1 | 1 | Ranged | Attack from a distance
3 | 1 | 1 | Melee | Must be able to reach the target with arm
4 | 2 | 0 | Musket | Shoots hot lead.
5 | 2 | 0 | Bomb | Fire damage over area
6 | 0 | 1 | Mount | Something that carries a load.
7 | 6 | 0 | Horse | It has no name.
8 | 6 | 0 | Donkey | Don't assume or you become one.
The system is currently running on PHP and SQLite but the database back-end is flexible and may use MySQL and the front-end may eventually use javascript or Object-C/Swift
The problem.
In the sample above the program must have a different special handling for each of the top level categories and the items underneath them. e.g. Weapon and Mount are sold by different merchants, weapons may be carried while a mount cannot.
What is the best way to flag the top level tiers in code for special handling?
While the top level categories are relatively fixed I would like to keep them in the DB so it is easier to generate the full hierarchy for visualization using a single (recursive) function.
Nearly all foreign keys that identify an item may also identify an item category so separating them into different tables seemed very clunky.
My thoughts.
I can use a string match on the name and store the id in an internal constant upon first execution. An ugly solution at best that I would like to avoid.
I can store the id in an internal constant at install time. better but still not quite what I prefer.
I can store an array in code of the top level elements instead of putting them in the table. This creates a lot of complications like how does a child point to the top level parent. Another id would have to be added to the table that is used by like 100 of the 10K rows.
I can store an array in code and enable identity insert at install time to add the top level elements sharing the identity of the static array. Probably my best idea but I don't really like the idea of identity insert it just doesn't feel "database" to me. Also what if a new top level item appears. Maybe start the ids at 1Million for these categories?
I can add a flag column "varchar(1) top_category" or "int top_category" with a character or bit-map indicating the value. Again a column used on like 10 of 10k rows.
As a software person I tend to fine software solutions so I'm curious if their is a more DB type solution out there.
Original table, with a join to actions.
Yes, you can put everything in a single table. You'd just need to establish unique rows for every scenario. This sqlfiddle gives you an example... but IMO it starts to become difficult to make sense of. This doesn't take care of all scenarios, due to not being able to do full joins (just a limitation of sqlfiddle that is awesome otherwise.)
IMO, breaking things out into tables makes more sense. Here's another example of how I'd start to approach a schema design for some of the scenarios you described.
The base tables themselve look clunky, but it gives so much more flexibility of how the data is used.
tl;dr analogy ahead
A datase isn't a list of outfits, organized in rows. It's where you store the cothes that make up an outfit.
So the clunky feel of breaking things out into separate tables, is actually the benefit of relational datbases. Putting everything into a single table feels efficient and optimized at first... but as you expand complexity... it starts to become a pain.
Think of your schema as a dresser. Drawers are you tables. If you only have a few socks and underware, putting them all in one drawer is efficient. But once you get enough socks, it can become a pain to have them all in the same drawer as your underware. You have dress socks, crew socks, ankle socks, furry socks. So you put them in another drawer. Once you have shirts, shorts, pants, you start putting them in drawers too.
The drive for putting all data into a single table is often driven by how you intend to use the data.
Assuming your dresser is fully stocked and neatly organized, you have several potential unique outfits; all neatly organized in your dresser. You just need to put them together. Select and Joins are you you would assemble those outfits. The fact that your favorite jean/t-shirt/sock combo isn't all in one drawer doesn't make it clunky or inefficient. The fact that they are separated and organized allows you to:
1. Quickly know where to get each item
2. See potential other new favorite combos
3. Quickly see what you have of each component of your outfit
There's nothing wrong with choosing to think of outfit first, then how you will put it away later. If you only have one outfit, putting everything in one drawer is way easier than putting each pieace in a separate drawer. However, as you expand your wardrobe, the single drawer for everything starts to become inefficient.
You typically want to plan for expansion and versatility. Your program can put the data together however you need it. A well organized schema can do that for you. Whether you use an ORM and do model driven data storage; or start with the schema, and then build models based on the schema; the more complex you data requirements become; the more similar both approaches become.
A relational database is meant to store entities in tables that relate to each other. Very often you'll see examples of a company database consisting of departments, employees, jobs, etc. or of stores holding products, clients, orders, and suppliers.
It is very easy to query such database and for example get all employees that have a certain job in a particular department:
select *
from employees
where job_id = (select id from job where name = 'accountant')
and dept_id = select id from departments where name = 'buying');
You on the other hand have only one table containing "things". One row can relate to another meaning "is of type". You could call this table "something". And were it about company data, we would get the job thus:
select *
from something
where description = 'accountant'
and parent_id = (select id from something where description = 'job');
and the department thus:
select *
from something
where description = 'buying'
and parent_id = (select id from something where description = 'department');
These two would still have to be related by persons working in a department in a job. A mere "is type of" doesn't suffice then. The short query I've shown above would become quite big and complex with your type of database. Imagine the same with a more complicated query.
And your app would either not know anything about what it's selecting (well, it would know it's something which is of some type and another something that is of some type and the person (if you go so far as to introduce a person table) is connected somehow with these two things), or it would have to know what description "department" means and what description "job" means.
Your database is blind. It doesn't know what a "something" is. If you make a programming mistake some time (most of us do), you may even store wrong relations (A Donkey is of type Musket and hence "shoots hot lead" while you can ride it) and your app may crash at one point or another not able to deal with a query result.
Don't you want your app to know what a weapon is and what a mount is? That a weapon enables you to fight and a mount enables you to travel? So why make this a secret? Do you think you gain flexibility? Well, then add food to your table without altering the app. What will the app do with this information? You see, you must code this anyway.
Separate entity from data. Your entities are weapons and mounts so far. These should be tables. Then you have instances (rows) of these entities that have certain attributes. A bomb is a weapon with a certain range for instance.
Tables could look like this:
person (person_id, name, strength_points, ...)
weapon (weapon_id, name, range_from, range_to, weight, force_points, ...)
person_weapon(person_id, weapon_id)
mount (mount_id, name, speed, endurance, ...)
person_mount(person_id, mount_id)
food (food_id, name, weight, energy_points, ...)
person_food (person_id, food_id)
armor (armor_id, name, protection_points, ...)
person_armor <= a table for m:n or a mere person.id_armor for 1:n
...
This is just an example, but it shows clearly what entities your app is dealing with. It knows weapons and food are something the person carries, so these can only have a maximum total weight for a person. A mount is something to use for transport and can make a person move faster (or carry weight, if your app and tables allow for that). Etc.
This is a theoretical question which I ask due to a request that has come my way recently. I own the support of a master operational data store which maintains a set of data tables (with the master data), along with a set of lookup tables (which contain a list of reference codes along with their descriptions). There has been recently a push from the downstream applications to unite the two structures (data and lookup values) logically in the presentation layer so that it is easier for them to find out if there have been updates in the overall data.
While the request is understandable, my first thought is that it should be implemented at the interface level rather than at the source. Combining the two tables logically (last_update_date) at ODS level is almost similar to the de-normalization of data and seems contrary to the idea of keeping lookups and data separate.
That said, I cannot think of any reason of why it should not be done at ODS level apart from the fact that it does not "seem" to be right... Does anyone have any thoughts around why such an approach should or should not be followed?
I am listing an example here for simplicity's sake.
Data table
ID Name Emp_typ_cd Last_update_date
1 X E1 2014-08-01
2 Y E2 2014-08-01
Code table
Emp_typ_cd Emp_typ_desc Last_Update_date
E1 Employee_1 2014-08-23
E2 Employee_2 2013-09-01
The downstream request is to represent the data as
Data view
ID Name Emp_typ_cd Last_update_date
1 X E1 2014-08-23
2 Y E2 2014-08-01
or
Data view
ID Name Emp_typ_cd Emp_typ_desc Last_update_date
1 X E1 Employee_1 2014-08-23
2 Y E2 Employee_2 2014-08-01
You are correct, it is demoralizing the database because someone wants to see the data in a certain way. The side effects, as you know, are that you are duplicating data, reducing flexibility, increasing table size, storing dissimilar objects together, etc. You are also correct that their problem should be solved somewhere or somehow else. They won’t get what they want if they change the database the way they want to change it. If they want to make it “easier for them to find out if there have been updates in the overall data” but they duplicate massive amounts of it, they’re just opening themselves up to errors. In your example the Emp_typ_cd Updated value must be updated for all employees with that emp type code. A good update statement will do that, but still, instead of updating a single row in the lookup table you’re updating every single employee that has the emp type.
We use lookup tables all the time. We can add a new value to a lookup table, add employees to the database with a fk to that new attribute, and any report that joins on that table now has the ID, Value, Sort Order, etc. Let’s say we add ‘Veteran’ to the lu_Work_Experience. We add an employee with the veteran fk_Id and now any existing query that joins on lu_Work_Experience has that value. They sort Work Experience alphabetically or by our pre-defined sort.
There is a valid reason for flattening your data structure though, and that is speed. If you’re running a very large report it will be faster with now joins (and good indexing). If the business knows it’s going to run a very large report many times and is worried about end user wait times, then it is a good idea to build a single table for that one report. We do it all the time for calculated measures. If we know that a certain analytic report will have a ton of aggregation and joins we pre-aggregate the data into the data store. That being said, we don’t do that very often in SQL because we use cubes for analytics.
So why use lookup tables in a database? Logical separation of data. An employee has a employee code, but it does NOT have a date of when an employee code was updated. Reduce duplicate data. Minimize design complexity. To avoid building a table for a specific report and then having to build a different table for a different report even if it has similar data.
Anyway, the rest of my argument would be comprised of facts from the Database Normalization wikipedia page so I’ll skip it.
After using Hibernate Enum Mapping again today I was wondering about whether it would be a good idea to stretch it a little bit more.
To explain where my thought came from: of course we aim for a normalized data model for our applications which often results in the fact, that we get alot of tables that contain something like a category, a state or similar data. Usually these tables have very few columns (often only PK and 1 or 2 content columns) and rows. Also, the content of those tables changes very rarely, sometimes even never.
If we'd use an enum for that and map it to the table by Ordinal or by an Integer (both with Hibernate, but I'd say any ORM can do that), wouldn't that be better in both performance (less joins) and handling (enums in Java can be used very elegantly)?
To clarify things a bit:
Table PERSONS
ID: Number
NAME: Varchar
RELATIONSHIP_STATUS_ID: Number
Table RELATIONSHIP_STATUS
ID: Number
STATUS: Varchar
Content PERSONS:
1 | John Doe | 1
2 | Mary Poppins | 2
Content RELATIONSHIP_STATUS
1 | Single
2 | Married
Now I'd dump the status table, have those two status in an enum and map that to the column by ordinal.
Would this be a senseful thing to do?
I especially would be interested if this kind of design would be better performance-wise.
My factors for choosing between a table and an enums are the following:
the list of possible values could change in the future, and we don't want to recompile, retest and redeploy the app when it happens: we use a table
the list of possible values could change in the future, but every value of the table is used in the code itself to implement some business logic (like if status == married then do something else do something): we'll need to change the logic anyway if the list of possible values change, so we use an enum
the list will never, ever change: we use an enum
You can still keep the table, and use an enum in the code, though. This makes it clearer when just looking at the data in the database, when you don't know how the enum is implemented. 0 meaning married and 1 meaning single is not obvious at all. If you keep the table just for reference, you can at least figure what the values mean, and make sure that it's not possible to insert 2 or any other number in the data.
Another way is to use the name of the enum rather than its ordinal. It takes up a bit more space and is a bit less efficient, but it makes the data even clearer and simpler to analyze. You lose the safety, though, unless you add a checked constraint.
The users I am concerned with can either be "unconfirmed" or "confirmed". The latter means they get full access, where the former means they are pending on approval from a moderator. I am unsure how to design the database to account for this structure.
One thought I had was to have 2 different tables: confirmedUser and unconfirmedUser that are pretty similar except that unconfirmedUser has extra fields (such as "emailConfirmed" or "confirmationCode"). This is slightly impractical as I have to copy over all the info when a user does get accepted (although I imagine it won't be that bad - not expecting heavy traffic).
The second way I imagined this would be to actually put all the users in the same table and have a key towards a table with the extra "unconfirmed" data if need be (perhaps also add a "confirmed" flag in the user table).
What are the advantages adn disadvantages of each approach and is there perhaps a better way to design the database?
The first approach means you'll need to write every query you have for two tables - for everything that's common. Bad (tm). The second option is definitely better. That way you can add a simple where confirmed = True (or False) as required for specific access.
What you could actually ponder over is whether or not the confirmed data (not the user, just the data) is stored in the same table. Perhaps it would be cleaner + normalized to have all confirmation data in a separate table so you left join confirmation on confirmation.userid = users.id where users.id is not null (or similar, or inner join, or get all + filter in server side script, etc.) to get only confirmed users. The additional data like confirmation email, date, etc. can be stored here.
Personally I would go for your second option: 1 users table with a confirmed/pending column of type boolean. Copying over data from one table to another identical table is impractical.
You can then create groups and attach specific access rights to each group and assign each user to a specific group if the need arises.
Logically, this is inheritance (aka. category, subclassing, subtype, generalization hierarchy etc.).
Physically, inheritance can be implemented in 3 ways, as mentioned here, here, here and probably in many other places on SO.
In this particular case, the strategy with all types in the same table seems most appropriate1, since the hierarchy is simple and unlikely to gain new subclasses, subclasses differ by only a few fields and you need to maintain the parent-level key (i.e. unconfirmed and confirmed user should not have overlapping keys).
1 I.e. the "second way" mentioned in your question. Whether to also put the confirmation data in the same table depends on the needed cardinality - i.e. is there a 1:N relationship there?
the Best way to do this is to have a Table for the users with a Status ID as a Foreign Key, the Status Table would have all the different types of Confirmations all the different combinations that you could have. this is the best way, in my opinion, to structure the Database for Normalization and for your programming needs.
so your Status Table would look like this
StatusID | Description
=============================================
1 | confirmed
2 | unconfirmed
3 | CC confirmed
4 | CC unconfirmed
5 | acct confirmed CC unconfirmed
6 | all confirmed
user table
userID | StatusID
=================
456 | 1
457 | 2
458 | 2
459 | 1
if you have a need for the Confirmation Code, you can store that inside the user table. and program it to change after it is used, so that you can use that same field if they need to reset a password or what ever.
maybe I am assuming too much?
I would like to ask someone, who has experiences in database design. This is my idea, and I can't assess deep consequences of such approach to, let's say, common problem. I appreciate your comments in advance...
Imagine:
- patients in hospital
- each patient should have:
1. personal data - Name, Surname, Street, SecurityID, contact, and many more (which could be changed over time)
2. personal records - a heap of various forms (also changing over time)
Typically I would design table for patient's pesonal data:
personaldata_tbl
| ID | SecurityID | Name | Surname ... | TimeOfEntry
and similar tables for each form in program. This could be very hard task, because it could reach several hundreds of such tables. In addition to it, probably their count will be increasingly growing.
And yes, all of them should be relationally connected for example:
releaseform_tbl
| ID | personaldata_tbl_ID | DateOfRelease | CauseOfRelease ... | TimeOfEntry
My intention is to revert 2D tables to single 1D table - all data about patients would be stored in one table! Other tables will describe (referentially) what kind of data is stored in the main table. Look at this:
data_info_tbl
| ID | Description |
| 1 | Name |
| 2 | Surname |
patient_data_tbl
| ID | patient_ID | data_info_ID | form_ID | TimeOfEntry | Value
| 1 | 121 | 1 | 7 | 17.12.2011 14:34 | John
| 2 | 121 | 2 | 7 | 17.12.2011 14:34 | Smith
The main reason, why this approach attracts me is:
- simplicity
- ability to store any data with appropriate specification and precision
- no table jungle
Contras:
- SQL querying could be problematic in some cases
- there should be reliable algorithm to delete, update, insert data (one way is to dynamically create table, perform operations on it, and finally store it)
- dataaware controls won't be used.
So what would you say ?
thanx for your time and answers
The most obvious problems . . .
You lose control of size. The "Value" column must be big enough to hold the largest type you use, which in the general case must be a blob. (X-ray images, in a hospital database.)
You lose data types. PostgreSQL, for example, includes the data types "point", bit string, internet address, cidr address, MAC address, and UUID. Storing all values in a column of a single type means you lose all the type-safety built into the specific data types.
You lose constraints. Some integers need to be constrained to between 1 and 10, others between 1000 and 3000. Some text strings need to be all numbers (ZIP codes), some need to be a particular mix of alpha and numerics (tire sizes).
You lose scalability. If there are 1000 attributes in a person's medical records, each person's data will take 1000 rows in the table. If you have 100,000 patients--an easily manageable number even in Microsoft Access and SQLite--your table suddenly balloons from a manageable 100,000 rows to 100,000,000 rows. Any query that does a table scan will have to scan 100 million rows, every time. Any single query that needs to return, say, 30 attributes will need 30 joins.
What you're proposing is the EAV anti-pattern. (Starts on slide 30.)
I disagree with Bert Evans (in the sense that I don't find this terribly valid).
First of all it's not clear to me what problem you are trying to solve. The three "benefits" you list:
simplicity
ability to store any data with appropriate specification
and precision
no table jungle
don't really make a lot of sense if the application is small, and if it isn't (like hospital records you mention in your example) any possible "gain" is lost when you start to factor in that this will make any sort of query very inefficient, and that human operators trying to design reports, data extractions or to extend the DB will have to put in a lot of extra effort.
Example: I suppose your hospital patient has an address and therefore a ZIP code... have you considered what loops you will have to jump in to create foreing index on the zip code/state table?
Another example: as soon as you realize that the patient may have a middle name and that on the form it will be placed between the first and last name what will you do? renumber all the last name fields? or place the middle name at the bottom of the pile, so that your form will have to re-add special logic to show it in the "correct" position?
You may want to check some alternatives to SQL DBs, like for example XML based data stores, or even MUMPS, but I really can't see any benefit in the approach you are proposing (and please consider I had seen an over-zealous DBA trying to do something very similar when designing a web application backed by an Oracle DB: every field/label/image on the webpage had just a numeric reference to a sequence-based ID record in the DB, making the whole webapp a nightmare to mantain - so I am not just being a "purist" here).