After using Hibernate Enum Mapping again today I was wondering about whether it would be a good idea to stretch it a little bit more.
To explain where my thought came from: of course we aim for a normalized data model for our applications which often results in the fact, that we get alot of tables that contain something like a category, a state or similar data. Usually these tables have very few columns (often only PK and 1 or 2 content columns) and rows. Also, the content of those tables changes very rarely, sometimes even never.
If we'd use an enum for that and map it to the table by Ordinal or by an Integer (both with Hibernate, but I'd say any ORM can do that), wouldn't that be better in both performance (less joins) and handling (enums in Java can be used very elegantly)?
To clarify things a bit:
Table PERSONS
ID: Number
NAME: Varchar
RELATIONSHIP_STATUS_ID: Number
Table RELATIONSHIP_STATUS
ID: Number
STATUS: Varchar
Content PERSONS:
1 | John Doe | 1
2 | Mary Poppins | 2
Content RELATIONSHIP_STATUS
1 | Single
2 | Married
Now I'd dump the status table, have those two status in an enum and map that to the column by ordinal.
Would this be a senseful thing to do?
I especially would be interested if this kind of design would be better performance-wise.
My factors for choosing between a table and an enums are the following:
the list of possible values could change in the future, and we don't want to recompile, retest and redeploy the app when it happens: we use a table
the list of possible values could change in the future, but every value of the table is used in the code itself to implement some business logic (like if status == married then do something else do something): we'll need to change the logic anyway if the list of possible values change, so we use an enum
the list will never, ever change: we use an enum
You can still keep the table, and use an enum in the code, though. This makes it clearer when just looking at the data in the database, when you don't know how the enum is implemented. 0 meaning married and 1 meaning single is not obvious at all. If you keep the table just for reference, you can at least figure what the values mean, and make sure that it's not possible to insert 2 or any other number in the data.
Another way is to use the name of the enum rather than its ordinal. It takes up a bit more space and is a bit less efficient, but it makes the data even clearer and simpler to analyze. You lose the safety, though, unless you add a checked constraint.
Related
The setup.
I have a table that stores a list of physical items for a game. Items also have a hierarchical list of categories. Example base table:
Items
id | parent_id | is_category | name | description
-- | --------- | ----------- | ------- | -----------
1 | 0 | 1 | Weapon | Something intended to cause damage
2 | 1 | 1 | Ranged | Attack from a distance
3 | 1 | 1 | Melee | Must be able to reach the target with arm
4 | 2 | 0 | Musket | Shoots hot lead.
5 | 2 | 0 | Bomb | Fire damage over area
6 | 0 | 1 | Mount | Something that carries a load.
7 | 6 | 0 | Horse | It has no name.
8 | 6 | 0 | Donkey | Don't assume or you become one.
The system is currently running on PHP and SQLite but the database back-end is flexible and may use MySQL and the front-end may eventually use javascript or Object-C/Swift
The problem.
In the sample above the program must have a different special handling for each of the top level categories and the items underneath them. e.g. Weapon and Mount are sold by different merchants, weapons may be carried while a mount cannot.
What is the best way to flag the top level tiers in code for special handling?
While the top level categories are relatively fixed I would like to keep them in the DB so it is easier to generate the full hierarchy for visualization using a single (recursive) function.
Nearly all foreign keys that identify an item may also identify an item category so separating them into different tables seemed very clunky.
My thoughts.
I can use a string match on the name and store the id in an internal constant upon first execution. An ugly solution at best that I would like to avoid.
I can store the id in an internal constant at install time. better but still not quite what I prefer.
I can store an array in code of the top level elements instead of putting them in the table. This creates a lot of complications like how does a child point to the top level parent. Another id would have to be added to the table that is used by like 100 of the 10K rows.
I can store an array in code and enable identity insert at install time to add the top level elements sharing the identity of the static array. Probably my best idea but I don't really like the idea of identity insert it just doesn't feel "database" to me. Also what if a new top level item appears. Maybe start the ids at 1Million for these categories?
I can add a flag column "varchar(1) top_category" or "int top_category" with a character or bit-map indicating the value. Again a column used on like 10 of 10k rows.
As a software person I tend to fine software solutions so I'm curious if their is a more DB type solution out there.
Original table, with a join to actions.
Yes, you can put everything in a single table. You'd just need to establish unique rows for every scenario. This sqlfiddle gives you an example... but IMO it starts to become difficult to make sense of. This doesn't take care of all scenarios, due to not being able to do full joins (just a limitation of sqlfiddle that is awesome otherwise.)
IMO, breaking things out into tables makes more sense. Here's another example of how I'd start to approach a schema design for some of the scenarios you described.
The base tables themselve look clunky, but it gives so much more flexibility of how the data is used.
tl;dr analogy ahead
A datase isn't a list of outfits, organized in rows. It's where you store the cothes that make up an outfit.
So the clunky feel of breaking things out into separate tables, is actually the benefit of relational datbases. Putting everything into a single table feels efficient and optimized at first... but as you expand complexity... it starts to become a pain.
Think of your schema as a dresser. Drawers are you tables. If you only have a few socks and underware, putting them all in one drawer is efficient. But once you get enough socks, it can become a pain to have them all in the same drawer as your underware. You have dress socks, crew socks, ankle socks, furry socks. So you put them in another drawer. Once you have shirts, shorts, pants, you start putting them in drawers too.
The drive for putting all data into a single table is often driven by how you intend to use the data.
Assuming your dresser is fully stocked and neatly organized, you have several potential unique outfits; all neatly organized in your dresser. You just need to put them together. Select and Joins are you you would assemble those outfits. The fact that your favorite jean/t-shirt/sock combo isn't all in one drawer doesn't make it clunky or inefficient. The fact that they are separated and organized allows you to:
1. Quickly know where to get each item
2. See potential other new favorite combos
3. Quickly see what you have of each component of your outfit
There's nothing wrong with choosing to think of outfit first, then how you will put it away later. If you only have one outfit, putting everything in one drawer is way easier than putting each pieace in a separate drawer. However, as you expand your wardrobe, the single drawer for everything starts to become inefficient.
You typically want to plan for expansion and versatility. Your program can put the data together however you need it. A well organized schema can do that for you. Whether you use an ORM and do model driven data storage; or start with the schema, and then build models based on the schema; the more complex you data requirements become; the more similar both approaches become.
A relational database is meant to store entities in tables that relate to each other. Very often you'll see examples of a company database consisting of departments, employees, jobs, etc. or of stores holding products, clients, orders, and suppliers.
It is very easy to query such database and for example get all employees that have a certain job in a particular department:
select *
from employees
where job_id = (select id from job where name = 'accountant')
and dept_id = select id from departments where name = 'buying');
You on the other hand have only one table containing "things". One row can relate to another meaning "is of type". You could call this table "something". And were it about company data, we would get the job thus:
select *
from something
where description = 'accountant'
and parent_id = (select id from something where description = 'job');
and the department thus:
select *
from something
where description = 'buying'
and parent_id = (select id from something where description = 'department');
These two would still have to be related by persons working in a department in a job. A mere "is type of" doesn't suffice then. The short query I've shown above would become quite big and complex with your type of database. Imagine the same with a more complicated query.
And your app would either not know anything about what it's selecting (well, it would know it's something which is of some type and another something that is of some type and the person (if you go so far as to introduce a person table) is connected somehow with these two things), or it would have to know what description "department" means and what description "job" means.
Your database is blind. It doesn't know what a "something" is. If you make a programming mistake some time (most of us do), you may even store wrong relations (A Donkey is of type Musket and hence "shoots hot lead" while you can ride it) and your app may crash at one point or another not able to deal with a query result.
Don't you want your app to know what a weapon is and what a mount is? That a weapon enables you to fight and a mount enables you to travel? So why make this a secret? Do you think you gain flexibility? Well, then add food to your table without altering the app. What will the app do with this information? You see, you must code this anyway.
Separate entity from data. Your entities are weapons and mounts so far. These should be tables. Then you have instances (rows) of these entities that have certain attributes. A bomb is a weapon with a certain range for instance.
Tables could look like this:
person (person_id, name, strength_points, ...)
weapon (weapon_id, name, range_from, range_to, weight, force_points, ...)
person_weapon(person_id, weapon_id)
mount (mount_id, name, speed, endurance, ...)
person_mount(person_id, mount_id)
food (food_id, name, weight, energy_points, ...)
person_food (person_id, food_id)
armor (armor_id, name, protection_points, ...)
person_armor <= a table for m:n or a mere person.id_armor for 1:n
...
This is just an example, but it shows clearly what entities your app is dealing with. It knows weapons and food are something the person carries, so these can only have a maximum total weight for a person. A mount is something to use for transport and can make a person move faster (or carry weight, if your app and tables allow for that). Etc.
I have a SQL Server 2008 database with a snowflake-style schema, so lots of different lookup tables, like Language, Countries, States, Status, etc. All these lookup table have almost identical structures: Two columns, Code and Decode. My project manager would like all of these different tables to be one BIG table, so I would need another column, say CodeCategory, and my primary key columns for this big table would be CodeCategory and Code. The problem is that for any of the tables that have the actual code (say Language Code), I cannot establish a foreign key relationship into this big decode table, as the CodeCategory would not be in the fact table, just the code. And codes by themselves will not be unique (they will be within a CodeCategory), so I cannot make an FK from just the fact table code field into the Big lookup table Code field.
So am I missing something, or is this impossible to do and still be able to do FKs in the related tables? I wish I could do this: have a FK where one of the columns I was matching to in the lookup table would match to a string constant. Like this (I know this is impossible but it gives you an idea what I want to do):
ALTER TABLE [dbo].[Users] WITH CHECK ADD CONSTRAINT [FK_User_AppCodes]
FOREIGN KEY('Language', [LanguageCode])
REFERENCES [dbo].[AppCodes] ([AppCodeCategory], [AppCode])
The above does not work, but if it did I would have the FK I need. Where I have the string 'Language', is there any way in T-SQL to substitute the table name from code instead?
I absolutely need the FKs so, if nothing like this is possible, then I will have to stick with my may little lookup tables. any assistance would be appreciated.
Brian
It is not impossible to accomplish this, but it is impossible to accomplish this and not hurt the system on several levels.
While a single lookup table (as has been pointed out already) is a truly horrible idea, I will say that this pattern does not require a single field PK or that it be auto-generated. It requires a composite PK comprised of ([AppCodeCategory], [AppCode]) and then BOTH fields need to be present in the fact table that would have a composite FK of both fields back to the PK. Again, this is not an endorsement of this particular end-goal, just a technical note that it is possible to have composite PKs and FKs in other, more appropriate scenarios.
The main problem with this type of approach to constants is that each constant is truly its own thing: Languages, Countries, States, Statii, etc are all completely separate entities. While the structure of them in the database is the same (as of today), the data within that structure does not represent the same things. You would be locked into a model that either disallows from adding additional lookup fields later (such as ISO codes for Language and Country but not the others, or something related to States that is not applicable to the others), or would require adding NULLable fields with no way to know which Category/ies they applied to (have fun debugging issues related to that and/or explaining to the new person -- who has been there for 2 days and is tasked with writing a new report -- that the 3 digit ISO Country Code does not apply to the "Deleted" status).
This approach also requires that you maintain an arbitrary "Category" field in all related tables. And that is per lookup. So if you have CountryCode, LanguageCode, and StateCode in the fact table, each of those FKs gets a matching CategoryID field, so now that is 6 fields instead of 3. Even if you were able to use TINYINT for CategoryID, if your fact table has even 200 million rows, then those three extra 1 byte fields now take up 600 MB, which adversely affects performance. And let's not forget that backups will take longer and take up more space, but disk is cheap, right? Oh, and if backups take longer, then restores also take longer, right? Oh, but the table has closer to 1 billion rows? Even better ;-).
While this approach looks maybe "cleaner" or "easier" now, it is actually more costly in the long run, especially in terms of wasted developer time, as you (and/or others) in the future try to work around issues related to this poor design choice.
Has anyone even asked your project manager what the intended benefit of this is? It is a reasonable question if you are going to spend some amount of hours making changes to the system that there be a stated benefit for that time spent. It certainly does not make interacting with the data any easier, and in fact will make it harder, especially if you choose a string for the "Category" instead of a TINYINT or maybe SMALLINT.
If your PM still presses for this change, then it should be required, as part of that project, to also change any enums in the app code accordingly so that they match what is in the database. Since the database is having its values munged together, you can accomplish that in C# (assuming your app code is in C#, if not then translate to whatever is appropriate) by setting the enum values explicitly with a pattern of the first X digits are the "category" and the remaining Y digits are the "value". For example:
Assume the "Country" category == 1 and the "Language" catagory == 2, you could do:
enum AppCodes
{
// Countries
United States = 1000001,
Canada = 1000002,
Somewhere Else = 1000003,
// Languages
EnglishUS = 2000001,
EnglishUK = 2000002,
French = 2000003
};
Absurd? Completely. But also analogous to the request of merging all lookup tables into a single table. What's good for the goose is good for the gander, right?
Is this being suggested so you can minimise the number of admin screens you need for CRUD operations on your standing data? I've been here before and decided it was better/safer/easier to build a generic screen which used metadata to decide what table to extract from/write to. It was a bit more work to build but kept the database schema 'correct'.
All the standing data tables had the same basic structure, they were mainly for dropdown population with occasional additional fields for business rule purposes.
The users I am concerned with can either be "unconfirmed" or "confirmed". The latter means they get full access, where the former means they are pending on approval from a moderator. I am unsure how to design the database to account for this structure.
One thought I had was to have 2 different tables: confirmedUser and unconfirmedUser that are pretty similar except that unconfirmedUser has extra fields (such as "emailConfirmed" or "confirmationCode"). This is slightly impractical as I have to copy over all the info when a user does get accepted (although I imagine it won't be that bad - not expecting heavy traffic).
The second way I imagined this would be to actually put all the users in the same table and have a key towards a table with the extra "unconfirmed" data if need be (perhaps also add a "confirmed" flag in the user table).
What are the advantages adn disadvantages of each approach and is there perhaps a better way to design the database?
The first approach means you'll need to write every query you have for two tables - for everything that's common. Bad (tm). The second option is definitely better. That way you can add a simple where confirmed = True (or False) as required for specific access.
What you could actually ponder over is whether or not the confirmed data (not the user, just the data) is stored in the same table. Perhaps it would be cleaner + normalized to have all confirmation data in a separate table so you left join confirmation on confirmation.userid = users.id where users.id is not null (or similar, or inner join, or get all + filter in server side script, etc.) to get only confirmed users. The additional data like confirmation email, date, etc. can be stored here.
Personally I would go for your second option: 1 users table with a confirmed/pending column of type boolean. Copying over data from one table to another identical table is impractical.
You can then create groups and attach specific access rights to each group and assign each user to a specific group if the need arises.
Logically, this is inheritance (aka. category, subclassing, subtype, generalization hierarchy etc.).
Physically, inheritance can be implemented in 3 ways, as mentioned here, here, here and probably in many other places on SO.
In this particular case, the strategy with all types in the same table seems most appropriate1, since the hierarchy is simple and unlikely to gain new subclasses, subclasses differ by only a few fields and you need to maintain the parent-level key (i.e. unconfirmed and confirmed user should not have overlapping keys).
1 I.e. the "second way" mentioned in your question. Whether to also put the confirmation data in the same table depends on the needed cardinality - i.e. is there a 1:N relationship there?
the Best way to do this is to have a Table for the users with a Status ID as a Foreign Key, the Status Table would have all the different types of Confirmations all the different combinations that you could have. this is the best way, in my opinion, to structure the Database for Normalization and for your programming needs.
so your Status Table would look like this
StatusID | Description
=============================================
1 | confirmed
2 | unconfirmed
3 | CC confirmed
4 | CC unconfirmed
5 | acct confirmed CC unconfirmed
6 | all confirmed
user table
userID | StatusID
=================
456 | 1
457 | 2
458 | 2
459 | 1
if you have a need for the Confirmation Code, you can store that inside the user table. and program it to change after it is used, so that you can use that same field if they need to reset a password or what ever.
maybe I am assuming too much?
I developing a tool which may got more than a million data to fill in.
current i have designed single table with 36 coloumns. my question is do I need to divide these into multiple tables or single??
If single what is the advantage and disadvantage
if multiple then what is the advantage and disadvantage
and what will be the engine to use for speed...
my concern is a large database which will have atleast 50000 queries perday..
any help??
Yes, you should normalize your database. A general rule of thumb is that if a column that isn't a foreign key contains duplicate values, the table should be normalized.
Normalization involves splitting your database into tables, and helps to:
Avoid modification anomolies.
Minimize impact of changes to the data structure.
Make the data model more informative.
There is plenty of information about normalization on Wikipedia.
If you have a serious amount of data and don't normalize, you will eventually come to a point where you will need to redesign your database, and this is incredibly hard to do retrospectively, as it will involve not only changing any code that accesses the database, but also migrating all existing data to the new design.
There are cases where it might be better to avoid normalization for performance reasons, but you should have a good understanding of normalization before making this decision.
First and foremost ask yourself are you repeating fields or attributes of fields. Does your one table contain relationships or attributes that should be separated. Follow third normal form...we need more info to help but generally speaking one table with thirty six columns smells like a db fart.
If you want to store a million rows of the same kind, go for it. Any decent database will cope even with much bigger tables.
Design your database to best fit the data (as seen from your application), get it up, and optimize later. You will probably find that performance is not a problem.
You should model your database according to the data you want to store. This is called "normalization": Essentially, each piece of information should only be stored once, otherwise a table cell should point to another row or table containing the value. If, for example, you have table containing phone numbers, and one column contains the area code, you will likely have more than one phone number with the same value in the same column. Once this happens, you should set up a new table for area codes and link to its entries by referencing the primary key of the row the desired area code is stored in.
So instead of
id | area code | number
---+-----------+---------
1 | 510 | 555-1234
2 | 510 | 555-1235
3 | 215 | 555-1236
4 | 215 | 555-1237
you would have
id | area code id | number | area code
---+---------- ---+----------+-----------
1 | 510 1 | 555-1234 | 1
2 | 215 2 | 555-1235 | 1
3 | 555-1236 | 2
4 | 555-1237 | 2
The more occurences of the same value you have, the more likely will you save memory and get quicker performance if you organize your data in this way, especially when you're handling string values or binary data. Also, if an area code would change, all you need to do is update a single cell instead of having to perform an update operation on the whole table.
Try this tutorial.
Correlation does not imply causation.
Just because shitloads of columns usually indicate a bad design, doesn't mean that a shitload of columns is a bad design.
If you have a normalized model, you store whatever number of columns you need a single table.
It depends!
Does that one table contain a single 'entity'? i.e. Are all 36 columns attributes a single thing, or are there several 'things' mixed together?
If mixed, then you should normalise (separate into distinct entities with relationships between them). You should aim for at least Third Normal Form (3NF).
A best practice is to normalise as much as you can; if you later identify a performance problem, then denormalise as little as you can.
This app I'm working on needs to store some meta data fields about an entity. The problem is that we can already foresee that these fields are going to change a lot in the future. Right now every entity's property is translated to one column in the entity table, but altering table columns later down the road will be costly and error-prone right?
Should I go for something like this (key-value store) instead?
MetaDataField
-----
metaDataFieldID (PK), name
FieldValue
----------
EntityID (PK, FK), metaDataFieldID (PK, FK), value [varchar(255)]
p.s. I also thought of using XML on SQL Server 05+. After talking to some ppl, seems like it is not a viable solution 'cause it will be too slow for doing certain query for reporting purposes.
You're right, you don't want to go changing your data schema any time a new parameter comes up!
I've seen two ways of doing something like this. One, just have a "meta" text field, and format the value to define both the parameter and the value. Joomla! does this, for example, to track custom article properties. It looks like this:
ProductTable
id name meta
--------------------------------------------------------------------------
1 prod-a title:'a product title',desc:'a short description'
2 prod-b title:'second product',desc:'n/a'
3 prod-c title:'3rd product',desc:'please choose sm med or large'
Another way of handling this is to use additional tables, like this:
ProductTable
product_id name
-----------------------
1 prod-a
2 prod-b
3 prod-c
MetaParametersTable
meta_id name
--------------------
1 title
2 desc
ProductMetaMapping
product_id meta_id value
-------------------------------------
1 1 a product title
1 2 a short description
2 1 second product
2 2 n/a
3 1 3rd product
3 2 please choose sm med or large
In this case, a query will need to join the tables, but you can optimize the tables better, can query for independent meta without returning all parameters, etc.
Choosing between them will depend on complexity, whether data rows ever need to have differing meta, and how the data will be consumed.
The Key Value table is a good idea and it works much faster than the SQL Server 2005 XML indexes. I started the same type of solution with XML in a project and had to change it to a indexed Key Value table to gain performance. I think SQL Server 2008 XML Indexes are faster, but have not tried them yet.
The XML speed only factors in depending on the size of the data going into the xml column. We had a project that stuffed data into and processed data from an xml column. It was very fast.. until you hit around 64kb. 63KB and less took milliseconds to get the data out or insert into. 64KB and the operations jumped to a full minute. Go figure.
Other than that the main issue we had was complexity. Working with xml data in sql server is not for the faint of heart.
Regardless, your best bet is to have a table of name / value pairs tied to the entity in question. Then it's easy to support having entities with either different properties or dynamically adding / removing properties. This too has it's caveats. For example, if you have more than say 10 properties, then it will be much faster to do pivots in code.
There is also a pattern for this to consider -- called the observation pattern.
See similar questions/answers: one, two, three.
The pattern is described in Martin Fowler's book Analysis Patterns, essentially it is an OO pattern, but can be done in DB schema too.
"altering table columns later down the road will be costly and error-prone right?"
A "table column", as you name it, has exactly two properties : its name and its data type. Therefore, "altering a table column" can refer only to two things : altering the name or altering the data type.
Wanting to alter the name is indeed a costly and error-prone operation, but fortunately there should never be a genuine business need for it. If a certain established column seems somewhat inappropriate, with afterthought, and "it might have been given a better name", then it is still not the case that the business incurs losses from that fact! Just stick with the old name, even if with afterthought, it was poorly chosen.
Wanting to alter the data type is indeed a costly operation, susceptible to breaking business operations that were running smoothly, but fortunately it is quite rare that a user comes round to tell you that "hey, I know I told you this attribute had to be a Date, but guess what, I was wrong, it has to be a Float.". And other changes of the same nature, but more likely to occur (e.g. from shortint to integer or so), can be avoided by being cautious when defining the database.
Other types of database changes (e.g. adding a new column) are usually not that dangerous and/or disruptive.
So don't let yourself be scared by those vague sloganesque phrases such as "changing a database is expensive and dangerous". They usually come from ignorants who know too little about database management to be involved in that particular field of our profession anyway.
Maintaining queries, constraints and constraint enforcement on an EAV database is very likely to turn out to be thousands of times more expensive than "regular" database structure changes.