I have a legacy in-house human resources web app that I'd like to rebuild using more modern technologies. Doctrine 2 is looking good. But I've not been able to find articles or documentation on how best to organise the Entities for a large-ish database (120 tables). Can you help?
My main problem is the Person table (of course! it's an HR system!). It currently has 70 columns. I want to refactor that to extract several subsets into one-to-one sub tables, which will leave me with about 30 columns. There are about 50 other supporting one-to-many tables called person_address, person_medical, person_status, person_travel, person_education, person_profession etc. More will be added later.
If I put all the doctrine associations ( in the Person entity class along with the set/get/add/remove methods for each, along with the original 30 columns and their methods, and some supporting utility functions then the Person entity is going to be 1000+ lines long and a nightmare to test.
FWIW i plan to create a PersonRepository to handle the common bulk queries, a PersonProfessionRepository for the bulk queries / reports on that sub table etc, and Person*Service s which will contain some of the more complex business logic where needed. So organising the rest of the app logic is fine: this is a question about how to correctly organise lots of sub-table Entities with Doctrine that all have relationships / associations back to one primary table. How do I avoid bloating out the Person entity class?

Identifying types of objects
It sounds like you have a nicely normalized database and I suggest you keep it that way. Removing columns from the people table to create separate tables for one-to-one relations isn't going to help in performance nor maintainability.
The fact that you recognize several groups of properties in the Person entity might indicate you have found cases for a Value Object. Even some of the one-to-many tables (like person_address) sound more like Value Objects than Entities.
Starting with Doctrine 2.5 (which is not yet stable at the time of this writing) it will support embedding single Value Objects. Unfortunately we will have to wait for a future version for support of collections of Value objects.
Putting that aside, you can mimic embedding Value Objects, Ross Tuck has blogged about this.
Lasagna Code
Your plan of implementing an entity, repository, service (and maybe controller?) for Person, PersonProfession, etc sounds like a road to Lasagna Code.
Without extensive knowledge about your domain, I'd say you want to have an aggregate Person, of which the Person entity is the aggregate root. That aggregate needs a single repository. (But maybe I'm off here and being simplistic, as I said, I don't know your domain.)
Creating a service for Person (and other entities / value objects) indicates data-minded thinking. For services it's better to think of behavior. Think of what kind of tasks you want to perform, and group coherent sets of tasks into services. I suspect that for a HR system you'll end up with many services that evolve around your Person aggregate.
Is Doctrine 2 suitable?
I would say: yes. Doctrine itself has no problems with large amounts of tables and large amounts of columns. But performance highly depends on how you use it.
For OLTP systems an ORM can be very helpful. OLTP involves many short transactions, writing a single (or short list) of aggregates to the database.
For OLAP systems an ORM is not suited. OLAP involves many complex analytical queries, usually resulting in large object-graphs. For these kind of operations, native SQL is much more convenient.
Even in case of OLAP systems Doctrine 2 can be of help:
You can use DQL queries (in stead of native SQL) to use the power of your mapping metadata. Then use scalar or array hydration to fetch the data.
Doctrine also support arbitrary joins, which means you can join entities that are not associated to each other according by mapping metadata.
And you can make use of the NativeQuery object with which you can map the results to whatever you want.
I think a HR system is a perfect example of where you have both OLTP and OLAP. OLTP when it comes to adding a new Person to the system for example. OLAP when it comes to various reports and analytics.
So there's nothing wrong with using an ORM for transactional operations, while using plain SQL for analytical operations.
Choose wisely
I think the key is to carefully choose when to use what, on a case by case basis.
Hydrating entities is great for transactional operations. Make use of lazy loading associations which can prevent fetching data you're not going to use. But also choose to eager load certain associations (using DQL) where it makes sense.
Use scalar or array hydration when working with large data sets. Data sets usually grow where you're doing analytical operations, where you don't really need full blown entities anyway.
#Quicker makes a valid point by saying you can create specialized View objects. You can fetch only the data you need in specific cases and manually mold that data into objects. This is accompanied by his point to don't bloat the user interface with options a user with a certain role doesn't need.
A technique you might want to look into is Command Query Responsibility Segregation (CQRS).

I understood that you have a fully normalized table persons and now you are asking for how to denormalize that best.
As long as you do not hit any technical constaints (such as max 64 K Byte) I find 70 columns definitly not overloaded for a persons table in a HR system. Do yourself a favour to not segment that information for following reasons:
selects potentially become more complex
each extract table needs (an) extra index/indeces, which increases your overall memory utilization -> this sounds to be a minor issue as disk is cheap. However keep in mind that via caching the RAM to disk space utilization ratio determines your performance to a huge extend
changes become more complex as extra relations demand for extra care
as any edit/update/read view can be restricted to deal with slices of your physical data from the tables only no "cosmetics" pressure arises from end user (or even admin) perspective
In summary your the table subsetting causes lots of issues and effort but does add low if not no value.
Efficient persistence strategy for many-to-many relationship

TL;DR: should I use an SQL JOIN table or Redis sets to store large amounts of many-to-many relationships
I have in-memory object graph structure where I have a "many-to-many" index represented as a bidirectional mapping between ordered sets:
group_by_user | user_by_group
louis: [1,2] | 1: [louis]
john: [2,3] | 2: [john, louis]
| 3: [john]
The basic operations that I need to be able to perform are atomic "insert at" and "delete" operations on the individual sets. I also need to be able to do efficient key lookup (e.g. lookup all groups a user is a member of, or lookup all the users who are members of one group). I am looking at a 70/30 read/write use case.
My question is: what is my best bet for persisting this kind of data structure? Should I be looking at building my own optimized on-disk storage system? Otherwise, is there a particular database that would excel at storing this kind of structure?
Before you read any further: stop being afraid of JOINs. This is a classic case for using a genuine relational database such as Postgres.
There are a few reasons for this:
This is what a real RDBMS is optimized for
The database can take care of your integrity constraints as a matter of course
You will have to push "join" logic into your own code
You will have to push "join" logic into your own code
You will have to deal with integrity concerns in your own code
You will have to deal with integrity concerns in your own code
You will wind up reinventing database features in your own code
You will wind up reinventing database features in your own code
This is what a real RDBMS is optimized for
Yes, I am being a little silly, but because I'm trying to drive home a point.
I am beating on that drum so hard because this is a classic case that has a readily available, extremely optimized and profoundly stable tool custom designed for it.
When I say that you will wind up reinventing database features I mean that you will start having to make basic data management decisions in your own code. For example, you will have to choose when to actually write the data to disk, when to pull it, how to keep track of the highest-frequency use data and cache it in memory (and how to manage that cache), etc. Making performance assumptions into your code early can give your whole codebase cancer early on without you noticing it -- and if those assumptions prove false later changing them can require a major rewrite.
If you store the data on either end of the many-to-many relationship in one store and the many-to-many map in another store you will have to:
Locate the initial data on one side of the mapping
Extract the key(s)
Query for the key(s) in the many-to-many handler
Receive the response set(s)
Query whatever is relevant from your other storage based on the result
Build your answer for use within the system
If you structure your data within an RDBMS to begin with your code will look more like:
Run a pre-built query indexed over whatever your search criteria is
Build an answer from the response
JOINs are a lot less scary than doing it all yourself -- especially in a concurrent system where other things may be changing in the course of your ad hoc locate-extract-query-receive-query-build procedure (which can be managed, of course, but why manage it when an RDBMS is already designed to manage it?).
JOIN isn't even a slow operation in decent databases. I have some business applications that join 20 tables constantly over fairly large tables (several millions of rows) and it zips right through them. It is highly optimized for this sort of thing which is why I use it. Oracle does well at this (but I can't afford it), DB2 is awesome (can't afford that, either), and SQL Server has come a long way (can't afford the good version of that one either!). MySQL, on the other hand, was really designed with the key-value store use-case in mind and matured in the "performance above all else" world of web applications -- and so it has some problems with integrity constraints and JOINs (but has handled replication very well for a very long time). So not all RDBMSes are created equal, but without knowing anything else about your problem they are the kind of datastore that will serve you best.
Even slightly non-trivial data can make your code explode in complexity -- hence the popularity of database systems. They aren't (supposed to be) religions, they are tools to let you separate a generic data-handling task from your own program's logic so you don't have to reinvent the wheel every project (but we tend to anyway).
Q: When would you not want to do this?
A: When you are really building a graph and not a set of many-to-many relations.
There is other type of database designed specifically to handle that case. You need to keep in mind, though, what your actual requirements are. Is this data ephemeral? Does it have to be correct? Do you care if you lose it? Does it need to be replicated? etc. Most of the time requirements are relatively trivial and the answer is "no" to these sort of higher-flying questions -- but if you have some special operational needs then you may need to take them into account when making your architectural decision.
If you are storing things that are actually documents (instead of structured records) on the one hand, and need to track a graph of relationships among them on the other then a combination of back-ends may be a good idea. A document database + a graphing database glued together by some custom code could be the right thing.
Think carefully about which kind of situation you are actually facing instead of assuming you have case X because it is what you are already familiar with.
In relational databases (e. g. SqlServer, MySql, Oracle...), the typical way of representing such data structures is with a "link table". For example:
users table:
userId (primary key)
groups table:
groupId (primary key)
userGroups table: (this is the link table)
userId (foreign key to users table)
groupId (foreign key to groups table)
compound primary key of (userId, groupId)
Thus, to find all groups with users named "fred", you might write the following query:
FROM users u
JOIN userGroups ug ON ug.userId = u.userId
JOIN groups g ON g.groupId = ug.groupId
WHERE = 'fred'
Database storage design of large amounts of heterogeneous data

Here is something I've wondered for quite some time, and have not seen a real (good) solution for yet. It's a problem I imagine many games having, and that I can't easily think of how to solve (well). Ideas are welcome, but since this is not a concrete problem, don't bother asking for more details - just make them up! (and explain what you made up).
Ok, so, many games have the concept of (inventory) items, and often, there are hundreds of different kinds of items, all with often very varying data structures - some items are very simple ("a rock"), others can have insane complexity or data behind them ("a book", "a programmed computer chip", "a container with more items"), etc.
Now, programming something like that is easy - just have everything implement an interface, or maybe extend an abstract root item. Since objects in the programming world don't have to look the same on the inside as on the outside, there is really no issue with how much and what kind of private fields any type of item has.
But when it comes to database serialization (binary serialization is of course no problem), you are facing a dilemma: how would you represent that in, say, a typical SQL database ?
Some attempts at a solution that I have seen, none of which I find satisfying:
Binary serialization of the items, the database just holds an ID and a blob.
Pro's: takes like 10 seconds to implement.
Con's: Basically sacrifices every database feature, hard to maintain, near impossible to refactor.
A table per item type.
Pro's: Clean, flexible.
Con's: With a wide variety come hundreds of tables, and every search for an item has to query them all since SQL doesn't have the concept of table/type 'reference'.
One table with a lot of fields that aren't used by every item.
Pro's: takes like 10 seconds to implement, still searchable.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
A few tables with a few 'base profiles' for storage where similar items get thrown together and use the same fields for different data.
Pro's: I've got nothing.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
What ideas do you have? Have you seen another design that works better or worse?
It depends if you need to sort, filter, count, or analyze those attribute.
If you use EAV, then you will screw yourself nicely. Try doing reports on an EAV schema.
The best option is to use Table Inheritance:
id pk
id pk fk PRODUCT
id pk fk PRODUCT
att 5
For attributes that you don't need to search/sort/analyze, then use a blob or xml
I have two alternatives for you:
One table for the base type and supplemental tables for each “class” of specialized types.
In this schema, properties common to all “objects” are stored in one table, so you have a unique record for every object in the game. For special types like books, containers, usable items, etc, you have another table for each unique set of properties or relationships those items need. Every special type will therefore be represented by two records: the base object record and the supplemental record in a particular special type table.
PROS: You can use column-based features of your database like custom domains, checks, and xml processing; you can have simpler triggers on certain types; your queries differ exactly at the point of diverging concerns.
CONS: You need two inserts for many objects.
Use a “kind” enum field and a JSONB-like field for the special type data.
This is kind of like your #1 or #3, except with some database help. Postgres added JSONB, giving you an improvement over the old EAV pattern. Other databases have a similar complex field type. In this strategy you roll your own mini schema that you stash in the JSONB field. The kind field declares what you expect to find in that JSONB field.
PROS: You can extract special type data in your queries; can add check constraints and have a simple schema to deal with; you can benefit from indexing even though your data is heterogenous; your queries and inserts are simple.
CONS: Your data types within JSONB-like fields are pretty limited and you have to roll your own validation.
Yes, it is a pain to design database formats like this. I'm designing a notification system and reached the same problem. My notification system is however less complex than yours - the data it holds is at most ids and usernames. My current solution is a mix of 1 and 3 - I serialize data that is different from every notification, and use a column for the 2 usernames (some may have 2 or 1). I shy away from method 2 because I hate that design, but it's probably just me.
However, if you can afford it, I would suggest thinking outside the realm of RDBMS - it sounds like Non-RDBMS (especially key/value storage ones) may be a better fit to store these data, especially if item 1 and item 2 differ from each item a lot.
I'm sure this has been asked here a million times before, but in addition to the options which you have discussed in your question, you can look at EAV schema which is very flexible, but which has its own sets of cons.
Another alternative is database systems which are not relational. There are object databases as well as various key/value stores and document databases.
Typically all these things break down to some extent when you need to query against the flexible attributes. This is kind of an intrinsic problem, however. Conceptually, what does it really mean to query things accurately which are unstructured?
First of all, do you actually need the concurrency, scalability and ACID transactions of a real database? Unless you are building a MMO, your game structures will likely fit in memory anyway, so you can search and otherwise manipulate them there directly. In a scenario like this, the "database" is just a store for serialized objects, and you can replace it with the file system.
If you conclude that you do (need a database), then the key is in figuring out what "atomicity" means from the perspective of the data management.
For example, if a game item has a bunch of attributes, but none of these attributes are manipulated individually at the database level (even though they could well be at the application level), then it can be considered as "atomic" from the data management perspective. OTOH, if the item needs to be searched on some of these attributes, then you'll need a good way to index them in the database, which typically means they'll have to be separate fields.
Once you have identified attributes that should be "visible" versus the attributes that should be "invisible" from the database perspective, serialize the latter to BLOBs (or whatever), then forget about them and concentrate on structuring the former.
That's where the fun starts and you'll probably need to use "all of the above" strategy for reasonable results.
BTW, some databases support "deep" indexes that can go into heterogeneous data structures. For example, take a look at Oracle's XMLIndex, though I doubt you'll use Oracle for a game.
You seem to be trying to solve this for a gaming context, so maybe you could consider a component-based approach.
I have to say that I personally haven't tried this yet, but I've been looking into it for a while and it seems to me something similar could be applied.
The idea would be that all the entities in your game would basically be a bag of components. These components can be Position, Energy or for your inventory case, Collectable, for example. Then, for this Collectable component you can add custom fields such as category, numItems, etc.
When you're going to render the inventory, you can simply query your entity system for items that have the Collectable component.
How can you save this into a DB? You can define the components independently in their own table and then for the entities (each in their own table as well) you would add a "Components" column which would hold an array of IDs referencing these components. These IDs would effectively be like foreign keys, though I'm aware that this is not exactly how you can model things in relational databases, but you get the idea.
Then, when you load the entities and their components at runtime, based on the component being loaded you can set the corresponding flag in their bag of components so that you know which components this entity has, and they'll then become queryable.
What is wrong with this database design?

I was pointed out by someone else that the following database design have serious issues, can anyone tell me why?
a tb_user table saves all the users information
tb_user table will have 3 - 8 users only.
each user's data will be saved in a separate table, naming after the user's name.
Say a user is called: bill_admin, then he has a seperate table, i.e. bill_admin_data, to save all data belongs to him. All users' data shared the same structure.
The person who pointed out this problem said I should merge all the data into one table, and uses FK to distinguish them, but I have the following statement:
users will only be 3 - 8, so there's not gonna be a lot of tables anyway.
each user has a very large data table, say 500K records.
Is it a bad practise to design database like this? And why? Thank you.
Because it isn't very maintainable.
1) Adding data to a database should never require modifying the structure. In your model, if you ever need to add another person you will need a new table (or two). You may not think you will ever need to do this, but trust me. You will.
So assume, for example, you want to add functionality to your application to add a new user to the database. With this structure you will have to give your end users rights to create new tables, which creates security problems.
2) It violates the DRY principle. That is, you are creating multiple copies of the same table structure that are identical. This makes maintenance a pain in the butt.
3) Querying across multiple users will be unnecessarily complicated. There is no good reason to split each user into a separate table other than having a vendetta against the person who has to write queries against this DB model.
4) If you are splitting it into multiple tables for performance because each user has a lot of rows, you are reinventing the wheel. The RDBMS you are using undoubtedly has an indexing feature which allows it to efficiently query large tables. Your home-grown hack is not going to outperform the platform's approach for handling large data.
I wouldn't say it's bad design per se. It is just not the type of design for which relational databases are designed and optimized for.
Of course, you can store your data as you mention, but many operations won't be trivial. For example:
Adding a new person
Removing a person
Generating reports based on data across all your people
If you don't really care about doing this. Go ahead and do your tables as you propose, although I would recommend using a non relational database, such as MongoDB, which is more suited for this type of structure.
If you prefer using relational databases, by aggregating data by type, and not by person gives you lots of flexibility when adding new people and calculating reports.
500k lines is not "very large", so don't worry about size when making your design.
Are there ORM (OKM) for key-value stores?

Object-Relational-Mappers have been created to help applications (which think in terms of objects) deal with stored data in a more application-friendly way like every other class/object.
However, I have never seen a OKM (Object-Key/Value-Mapper) for NoSQL "Key/Value" storage systems. Which seems odd because the need should be far greater given the fact that more value-relations will have to be hard-coded into the app than a regular, single SQL table row object.
four requests:
vs one request:
user = [id => ..., name => ..., email => ...]
Plus you must keep track of "lists" (post has_many comments) since you don't have has_many through tables or foreign keys.
INSERT INTO user_groups (user_id, group_id) VALUES (23, 54)
usergroups:user_id = {54,108,32,..}
groupsuser:group_id = {23,12,645,..}
And there are lots more examples of the added logic that an application would need to replicate some basic features that normal relational databases use. All of these reasons make the idea of a OKM sound like a shoe-in.
Are there any? Are there any reasons there are not any?
Ruby's DataMapper project is an ORM and will happily talk to a key-value store through the use of an adapter.
Redis and MongoDB have adapters that already exist. CouchDB has an adapter — it's not maintained, but at one point it worked pretty well. I don't think anyone's done anything with Cassandra yet, but there's no reason it couldn't be done. The Dubious framework for Google App Engine takes a very similar approach to Data Mapper to make the Data Store available to applications.
So it's very possible to do ORM with key-value stores. The ORM just really needs to avoid the assumption that SQL is its primary vocabulary.
One of the design goals of SQL is that any data can be stored/queried in any relational database - There are some differences between platforms, but in general the correct way to handle a particular data structure is well known and easily automated but requiring fairly verbose code. That is not the case with NoSQL - generally you will be directly storing the data as used in your application rather than trying to map it to a relational structure, and without joins or other object/relational differences the mapping code is trivial.
Beyond generating the boilerplate data access code, one of the main purposes of an ORM is abstraction of differences between platforms. In my experience the ability to switch platforms has always been purely theoretical, and this lowest common denominator approach simply won't work for NoSQL as the platform is usually chosen specifically for capabilities not present on other platforms. Your example is only for the most trivial key value store - depending on your platform you most likely have some useful additional commands, so your first example could be
MGET user:id:name user:id:email ... (multiget - get any number of keys in a single call)
GET user:id:* (key wildcards)
HGETALL user:id (redis hash - gets all subkeys of user)
You might also have your user object stored in a serialized form - unlike in a relational database this will not break all your queries.
Working with lists isn't great if your platform doesn't have support built in - native list/set support is one of the reasons I like to use redis - but aside from potentially needing locks it's no worse than getting the list out of sql.
It's also worth noting that you may not need all the relationships you would define in sql - for example if you have a group containing a million users, the ability to get a list of all users in a group is completely useless, so you would never create the groupsuser list at all and rather than a seperate usergroups list have user:id:groups as a multivalue property. If you just need to check for membership you could set up keys as usergroups:userid:groupid and get constant time lookup.
I find it helps to think in terms of indexes rather than relationships - when setting up your data access code decide which fields will need to be queried and adding appropriate index records when those fields are written.
ORMs don't map terribly well to the schema-less nature of key-value stores. That being said, if you're using Riak and Ruby, you could take a look at Ripple. There are a number of other drivers for Riak which might fit with your language.
If you're looking into MongoDB (more of a document store than a k/v store), there are a number of drivers available.
The UNIVERSE db , which is a descendent of Pick, lets you store a list of key value pairs for a given key. However this is very old technoligy and the world ran away from these databases a long time ago.
You can implement this in an SQL database with a three column table
Although most DBAs will hit you over the head with the very thick Codd and Date hardback edition if you propose this, it is in fact a very common pattern in packaged applications to allow you to add site specific attributes to a system.
To prarphrase Richrd Stallmans comments on LISP.
SQL-Server DB design time scenario (distributed or centralized)

We've an SQL Server DB design time scenario .. we've to store data about different Organizations in our database (i.e. like Customer, Vendor, Distributor, ...). All the diff organizations share the same type of information (almost) .. like Address details, etc... And they will be referred in other tables (i.e. linked via OrgId and we have to lookup OrgName at many diff places)
I see two options:
We create a table for each organization like OrgCustomer, OrgDistributor, OrgVendor, etc... all the tables will have similar structure and some tables will have extra special fields like the customer has a field HomeAddress (which the other Org tables don't have) .. and vice-versa.
We create a common OrgMaster table and store ALL the diff Orgs at a single place. The table will have a OrgType field to distinguish among the diff types of Orgs. And the special fields will be appended to the OrgMaster table (only relevant Org records will have values in such fields, in other cases it'll be NULL)
Some Pros & Cons of #1:
It helps distribute the load while accessing diff type of Org data so I believe this improves performance.
Provides a full scope for accustomizing any particular Org table without effecting the other existing Org types.
Not sure if diff indexes on diff/distributed tables work better then a single big table.
Replication of design. If I have to increase the size of the ZipCode field - I've to do it in ALL the tables.
Replication in manipulation implementation (i.e. we've used stored procedures for CRUD operations so the replication goes n-fold .. 3-4 Inert SP, 2-3 SELECT SPs, etc...)
Everything grows n-fold right from DB constraints\indexing to SP to the Business objects in the application code.
Change(common) in one place has to be made at all the other places as well.
Some Pros & Cons of #2:
N-fold becomes 1-fold :-)
Maintenance gets easy because we can try and implement single entry points for all the operations (i.e. a single SP to handle CRUD operations, etc..)
We've to worry about maintaining a single table. Indexing and other optimizations are limited to a single table.
Does it create a bottleneck? Can it be managed by implementing Views and other optimized data access strategy?
The other side of centralized implementation is that a single change has to be tested and verified at ALL the places. It isn't abstract.
The design might seem a little less 'organized\structured' esp. due to those few Orgs for which we need to add 'special' fields (which are irrelevant to the other tables)
I also got in mind an Option#3 - keep the Org tables separate but create a common OrgAddress table to store the common fields. But this gets me in the middle of #1 & #2 and it is creating even more confusion!
To be honest, I'm an experienced programmer but not an equally experienced DBA because that's not my main-stream job so please help me derive the correct tradeoff between parameters like the design-complexity and performance.
Thanks in advance. Feel free to ask for any technical queries & suggestions are welcome.
I would say that your 2nd option is close, just few points:
Customer, Distributor, Vendor are TYPES of organizations, so I would suggest:
Table [Organization] which has all columns common to all organizations and a primary key for the row.
Separate tables [Vendor], [Customer], [Distributor] with specific columns for each one and FK to the [Organization] row PK.
The sounds like a "supertype/subtype relationship".
I have worked on various applications that have implemented all of your options. To be honest, you probably need to take account of the way that your users work with the data, how many records you are expecting, commonality (same organisation having multiple functions), and what level of updating of the records you are expecting.
