How do I structure multiple Identity Data in a database - database

Am designing a database for a credit bureau and am seeking some guidance.
The data they receive from Banks, MFIs, Saccos, Utility companies etc comes with various types of IDs. E.g. It is perfectly legal to open a bank account with a National ID and also a Passport. Scenario One that has my head banging is that Customer1 will take a credit facility (call it loan for now) in bank1 with the passport and then go to bank2 and take another loan with their NationalID and Bank3 with their MilitaryID. Eventually when this data comes from the banks to the bureau, it would be seen as 3 different people while we know that its actually 1 person. At this point, there is nothing we can do as a bureau.
However, one way out (for now) is using the Govt registry which provides a repository which holds both passports and IDS. So once we query for this information and get a response, how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Again, a person's name could be captured in various orders states. Bank1 could do FName, LName, OName while Bank3 can do LName, FName only. How do I store this names?
Even against one ID type e.g. NationalID, you will often find misspellt names or missing names. So one NationalID in our database could end up with about 6 different names because the person's name was captured different by the various banks where he has transacted.
And that is just the tip of the iceberg. We have issues with addresses, telephone numbers, etc etc.
Could you have any insight as to how I'd structure my database to ensure we capture all data from all banks and provide the most accurate information possible regarding an individual? Better yet, do you have experience with this type of setup?
Thanks.

how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Trivial.
You ahve an identity table, that has an AlternateId field if the Identity is linked to another one. Use the first IDentity you created as master. Any alternative will have AlternateId pointing to it.
You need to separate the identity from the data in it, so you can have alterante versions of it, possibly with an origin and timestampt. You need oto likely fully support versioning and tying different identities to each other as alternative, including generating a "master identity" possibly by algorithm with the "official" version of your data (i.e. consolidated).
The details are complex - mostly you ahve to make a LOT of compromises without killing performance, so at the end HIRE A SPECIALIST. There is a reason there are people out as sensior database designers or architects that have 20+ years experience finding the optimal solution given the constrints you may not even be aware of (application wise).
Better yet, do you have experience with this type of setup?
Yes. Try financial information. Stock symbols / feeds / definitions are not necessariyl compatible and vary by whom you get it. Any non-trivial setup has different data feeds that may show the same item slightly different, sometimes in error. DIfferent name, sometimes different price (example: ES, CME group, is 50 USD per point, but on TT Fix it is 5 - to make up, the price is multiplied by 10, so instad of 1000.25 you get 10002.5). THis is the same line of consolidation, and it STINKS.
Tons of code, tons of proper database design, redoing it half a dozen time to get the proper performance. THis is tricky, sadly.

Related

Customer Deduplication in Booking Application

We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.
The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:
If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.
I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).
I would appreciate very much any help so I can drive the engineering team to the right direction.
In General
What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?
It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.
If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.
If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.
In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.
Specifics
...check if the (newly created) customer
for this reservation has already an old customer id (by comparing
email...
Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.
... and other aspects.
What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?
If it has one or more old reservations, detach that reservation from
the old customer id, and link it to a new customer id. Literally by
changing the customer ID of that old reservation to the newly created
customer.
Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.
As we have several operational applications relying on that data, this creates a massive sync issue.
...case in point.
In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.
Option - Phased Introduction Based on Testing
You could try this:
Keep the system as-is for now.
Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.
Alternative Customer/Session Approach
Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
Allow users to optionally be validated as "Customers" (big C).
Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.
I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.
I wouldn't say that's a terrible design, it's just a simple approach of solving this particular problem, with some room for improvement. It's not optimal because the runtime of that job depends on the new bookings that are received during the day, which may vary from day to day, so other workflows that depend on that will be impacted.
This approach can be improved by processing new bookings in parallel, and using an index to get a fast lookup when checking if a new e-mail already exists or not.
You can also check out Bloom Filters - an efficient data structure that is able to tell you if an element is not in a given set.
The way I would do it is to store the bookings in a No-SQL DB table keyed-off the user email. You get the user email in both situations - when it has an account or when it makes a booking without an account, so you just have to make a lookup to get the bookings by email, which makes that deduplication job redundant.

Making a table with fixed columns versus key-valued pairs of metadata?

I was asked to create a table to store paid-hours data from multiple attendance systems from multiple geographies from multiple sub-companies. This table would be used for high level reporting so basically it is skipping the steps of creating tables for each system (which might exist) and moving directly to what the final product would be.
The request was to have a dimension for each type of hours or pay like this:
date | employee_id | type | hours | amount
2016-04-22 abc123 regular 80 3500
2016-04-22 abc123 overtime 6 200
2016-04-22 abc123 adjustment 1 13
2016-04-22 abc123 paid time off 24 100
2016-04-22 abc123 commission 600
2016-04-22 abc123 gross total 4413
There are multiple rows per employee but the though process is that this will allow us to capture new dimensions if they are added.
The data is coming from several sources and I was told not to worry about the ETL, but just design the ultimate table and make it work for any system. We would provide this format to other people for them to fill in.
I have only seen the raw data from one system and it like this:
date | employee_id | gross_total_amount | regular_hours | regular_amount | OT_hours | OT_amount | classification | amount | hours
It is pretty messy. Multiple rows for employees and values like gross_total repeat each row. There is a classification column which has items like PTO (paid time off), adjustments, empty values, commission, etc. Because of repeating values, it is impossible to just simply sum the data up to make it equal the gross_total_amount.
Anyways, I kind of would prefer to do a column based approach where each row describes the employees paid hours for a cut off. One problem is that I won't know all of the possible types of hours which are possible so I can't necessarily make a table like:
date | employee_id | gross_total_amount | commission_amount | regular_hours | regular_amount | overtime_hours | overtime_amount | paid_time_off_hours | paid_time_off_amount | holiday_hours | holiday_amount
I am more used to data formatted that way though. The concern is that you might not capture all of the necessary columns or if something new is added. (For example, I know there is maternity leave, paternity leave, bereavement leave, in other geographies there are labor laws about working at night, etc)
Any advice? Is the table which was suggested to me from my superior a viable solution?
TAM makes lots of good points, and I have only two additional suggestions.
First, I would generate some fake data in the table as described above, and see if it can generate the required reports. Show your manager each of the reports based on the fake data, to check that they're OK. (It appears that the reports are the ultimate objective, so work back from there.)
Second, I would suggest that you get sample data from as many of the input systems as you can. This is to double check that what you're being asked to do is possible for all systems. It's not so you can design the ETL, or gather new requirements, just testing it all out on paper (do the ETL in your head). Use this to update the fake data, and generate fresh fake reports, and check the reports again.
Let me recapitulate what I understand to be the basic task.
You get data from different sources, having different structures. Your task is to consolidate them in a single database to be able to answer questions about all these data. I understand the hint about "not to worry about the ETL, but just design the ultimate table" in that way that your consolidated database doesn't need to contain all detail information that might be present in the original data, but just enough information to fulfill the specific requirements to the consolidated database.
This sounds sensible as long as your superior is certain enough about these requirements. In that case, you will reduce the information coming from each source to the consolidated structure.
In any way, you'll have to capture the domain semantics of the data coming in from each source. Lacking access to your domain semantics, I can't clarify the mess of repeating values etc. for you. E.g., if there are detail records and gross total records, as in your example, it would be wrong to add the hours of all records, as this would always yield twice the hours actually worked. So someone will have to worry about ETL, namely interpreting each set of records, probably consisting of all entries for an employee and one working day, find out what they mean, and transform them to the consolidated structure.
I understand another part of the question to be about the usage of metadata. You can have different columns for notions like holiday leave and maternity leave, or you have a metadata table containing these notions as a key-value pair, and refer to the key from your main table. The metadata way is sometimes praised as being more flexible, as you can introduce a new type (like paternity leave) without redesigning your database. However, you will need to redesign the software filling and probably also querying your tables to make use of the new type. So you'll have to develop and deploy a new software release anyway, and adding a few columns to a table will just be part of that development effort.
There is one major difference between a broad table containing all notions as attributes and the metadata approach. If you want to make sure that, for a time period, either all or none of the values are present, that's easy with the broad table: Just make all attributes `not null´, and you're done. Ensuring this for the metadata solution would mean some rather complicated constraint that may or may not be available depending on the database system you use.
If that's not a main requirement, I would go a pragmatic way and use different columns if I expect only a handful of those types, and a separate key-value table otherwise.
All these considerations relied on your superior's assertion (as I understand it) that your consolidated table will only need to fulfill the requirements known today, so you are free to throw original detail information away if it's not needed due to these requirements. I'm wary of that kind of assertion. Let's assume some of your information sources deliver additional information. Then it's quite probable that someday someone asks for a report also containing this information, where present. This won't be possible if your data structure only contains what's needed today.
There are two ways to handle this, i.e. to provide for future needs. You can, after knowing the data coming from each additional source, extend your consolidated database to cover all data structures coming from there. This requires some effort, as different sources might express the same concept using different data, and you would have to consolidate those to make the data comparable. Also, there is some probability that not all of your effort will be worth the trouble, as not all of the detail information you get will actually be needed for your consolidated database. Another more elegant way would therefore be to keep the original data that you import for each source, and only in case of a concrete new requirement, extend your database and reimport the data from the sources to cover the additional details. Prices of storage being low as they are, this might yield an optimal cost-benefit ratio.

Designing a database with periodic sensor data

I'm designing a PostgreSQL database that takes in readings from many sensor sources. I've done a lot of research into the design and I'm looking for some fresh input to help get me out of a rut here.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
The basic structure of the data coming in is as follows:
For each data logging device, there are several channels.
For each channel, the logger reads data and attaches it to a record with a timestamp
Different channels may have different data types, but generally a float4 will suffice.
Users should (through database functions) be able to add different value types, but this concern is secondary.
Loggers and channels will also be added through functions.
The distinguishing characteristic of this data layout is that I've got many channels associating data points to a single record with a timestamp and index number.
Now, to describe the data volume and common access patterns:
Data will be coming in for about 5 loggers, each with 48 channels, for every minute.
The total data volume in this case will be 345,600 readings per day, 126 million per year, and this data needs to be continually read for the next 10 years at least.
More loggers & channels will be added in the future, possibly from physically different types of devices but hopefully with similar storage representation.
Common access will include querying similar channel types across all loggers and joining across logger timestamps. For example, get channel1 from logger1, channel4 from logger2, and do a full outer join on logger1.time = logger2.time.
I should also mention that each logger timestamp is something that is subject to change due to time adjustment, and will be described in a different table showing the server's time reading, the logger's time reading, transmission latency, clock adjustment, and resulting adjusted clock value. This will happen for a set of logger records/timestamps depending on retrieval. This is my motivation for RecordTable below but otherwise isn't of much concern for now as long as I can reference a (logger, time, record) row from somewhere that will change the timestamps for associated data.
I have considered quite a few schema options, the most simple resembling a hybrid EAV approach where the table itself describes the attribute, since most attributes will just be a real value called "value". Here's a basic layout:
RecordTable DataValueTable
---------- --------------
[PK] id <-- [FK] record_id
[FK] logger_id [FK] channel_id
record_number value
logger_time
Considering that logger_id, record_number, and logger_time are unique, I suppose I am making use of surrogate keys here but hopefully my justification of saving space is meaningful here. I have also considered adding a PK id to DataValueTable (rather than the PK being record_id and channel_id) in order to reference data values from other tables, but I am trying to resist the urge to make this model "too flexible" for now. I do, however, want to start getting data flowing soon and not have to change this part when extra features or differently-structured-data need to be added later.
At first, I was creating record tables for each logger and then value tables for each channel and describing them elsewhere (in one place), with views to connect them all, but that just felt "wrong" because I was repeating the same thing so many times. I guess I'm trying to find a happy medium between too many tables and too many rows, but partitioning the bigger data (DataValueTable) seems strange because I'd most likely be partitioning on channel_id, so each partition would have the same value for every row. Also, partitioning in that regard would require a bit of work in re-defining the check conditions in the main table every time a channel is added. Partitioning by date is only applicable to the RecordTable, which isn't really necessary considering how relatively small it will be (7200 rows per day with the 5 loggers).
I also considered using the above with partial indexes on channel_id since DataValueTable will grow very large but the set of channel ids will remain small-ish, but I am really not certain that this will scale well after many years. I have done some basic testing with mock data and the performance is only so-so, and I want it to remain exceptional as data volume grows. Also, some express concern with vacuuming and analyzing a large table, and dealing with a large number of indexes (up to 250 in this case).
On a very small side note, I will also be tracking changes to this data and allowing for annotations (e.g. a bird crapped on the sensor, so these values were adjusted/marked etc), so keep that in the back of your mind when considering the design here but it is a separate concern for now.
Some background on my experience/technical level, if it helps to see where I'm coming from: I am a CS PhD student, and I work with data/databases on a regular basis as part of my research. However, my practical experience in designing a robust database for clients (this is part of a business) that has exceptional longevity and flexible data representation is somewhat limited. I think my main problem now is I am considering all the angles of approach to this problem instead of focusing on getting it done, and I don't see a "right" solution in front of me at all.
So In conclusion, I guess these are my primary queries for you: if you've done something like this, what has worked for you? What are the benefits/drawbacks I'm not seeing of the various designs I've proposed here? How might you design something like this, given these parameters and access patterns?
I'll be happy to provide clarification/details where needed, and thanks in advance for being awesome.
It is no problem at all to provide all this in a Relational database. PostgreSQL is not enterprise class, but it is certainly one of the better freeware SQLs.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
That is your biggest obstacle. Contrary to program design, which allows decomposition and isolated analysis/design of components, databases need to be designed as a single unit. Normalisation and other design techniques need to consider both the whole, and the component in context. The data, the descriptions, the metadata have to be evaluated together, not as separate parts.
Second, when you start off with surrogate keys, implying that you know the data, and how it relates to other data, it prevents you from genuine modelling of the data.
I have answered a very similar set of questions, coincidentally re very similar data. If you could read those answers first, it would save us both a lot of typing time on your question/answer.
Answer One/ID Obstacle
Answer Two/Main
Answer Three/Historical
I did something like this with seismic data for a petroleum exploration company.
My suggestion would be to store the meta-data in a database, and keep the sensor data in flat files, whatever that means for your computer's operating system.
You would have to write your own access routines if you want to modify the sensor data. Actually, you should never modify the sensor data. You should make a copy of the sensor data with the modifications so that you can show later what changes were made to the sensor data.

Designing tables for storing various requirements and stats for multiplayer game

Original Question:
Hello,
I am creating very simple hobby project - browser based multiplayer game. I am stuck at designing tables for storing information about quest / skill requirements.
For now, I designed my tables in following way:
table user (basic information about users)
table stat (variety of stats)
table user_stats (connecting each user with stats)
Another example:
table monsters (basic information about npc enemies)
table monster_stats (connecting monsters with stats, using the same stat table from above)
Those were the simple cases. I must admit, that I am stuck while designing requirements for different things, e.g quests. Sample quest A might have only minimum character level requirement (and that is easy to implement) - but another one, quest B has multitude of other reqs (finished quests, gained skills, possessing specific items, etc) - what is a good way of designing tables for storing this kind of information?
In a similar manner - what is an efficient way of storing information about skill requirements? (specific character class, min level, etc).
I would be grateful for any help or information about creating database driven games.
Edit:
Thank You for the answers, yet I would like to receive more. As I am having some problems designing an rather complicated database layout for craftable items, I am starting a max bounty for this question.
I would like to receive links to articles / code snippets / anything connected with best practices of designing databases for storing game data (an good example of this kind of information is availibe on buildingbrowsergames.com).
I would be grateful for any help.
I'll edit this to add as many other pertinent issues as I can, although I wish the OP would address my comment above. I speak from several years as a professional online game developer and many more years as a hobbyist online game developer, for what it's worth.
Online games imply some sort of persistence, which means that you have broadly two types of data - one is designed by you, the other is created by the players in the course of play. Most likely you are going to store both in your database. Make sure you have different tables for these and cross-reference them properly via the usual database normalisation rules. (eg. If your player crafts a broadsword, you don't create an entire new row with all the properties of a sword. You create a new row in the player_items table with the per-instance properties, and refer to the broadsword row in the item_types table which holds the per-itemtype properties.) If you find a row of data is holding some things that you designed and some things that the player is changing during play, you need to normalise it out into two tables.
This is really the typical class/instance separation issue, and applies to many things in such games: a goblin instance doesn't need to store all the details of what it means to be a goblin (eg. green skin), only things pertinent to that instance (eg. location, current health). Some times there is a subtlety to the act of construction, in that instance data needs to be created based on class data. (Eg. setting a goblin instance's starting health based upon a goblin type's max health.) My advice is to hard-code these into your code that creates the instances and inserts the row for it. This information only changes rarely since there are few such values in practice. (Initial scores of depletable resources like health, stamina, mana... that's about it.)
Try and find a consistent terminology to separate instance data from type data - this will make life easier later when you're patching a live game and trying not to trash the hard work of your players by editing the wrong tables. This also makes caching a lot easier - you can typically cache your class/type data with impunity because it only ever changes when you, the designer, pushes new data up there. You can run it through memcached, or consider loading it all at start up time if your game has a continuous process (ie. is not PHP/ASP/CGI/etc), etc.
Remember that deleting anything from your design-side data is risky once you go live, since player-generated data may refer back to it. Test everything thoroughly locally before deploying to the live server because once it's up there, it's hard to take it down. Consider ways to be able to mark rows of such data as removed in a safe fashion - maybe a boolean 'live' column which, if set to false, means it just won't show up in the typical query. Think about the impact on players if you disable items they earned (and doubly if these are items they paid for).
The actual crafting side can't really be answered without knowing how you want to design your game. The database design must follow the game design. But I'll run through a trivial idea. Maybe you will want to be able to create a basic object and then augment it with runes or crystals or whatever. For that, you just need a one-to-many relationship between item instance and augmentation instance. (Remember, you might have item type and augmentation type tables too.) Each augmentation can specify a property of an item (eg. durability, max damage done in combat, weight) and a modifier (typically as a multiplier, eg. 1.1 to add a 10% bonus). You can see my explanation for how to implement these modifying effects here and here - the same principles apply for temporary skill and spell effects as apply for permanent item modification.
For character stats in a database driven game, I would generally advise to stick with the naïve approach of one column (integer or float) per statistic. Adding columns later is not a difficult operation and since you're going to be reading these values a lot, you might not want to be performing joins on them all the time. However, if you really do need the flexibility, then your method is fine. This strongly resembles the skill level table I suggest below: lots of game data can be modelled in this way - map a class or instance of one thing to a class or instance of other things, often with some additional data to describe the mapping (in this case, the value of the statistic).
Once you have these basic joins set up - and indeed any other complex queries that result from the separation of class/instance data in a way that may not be convenient for your code - consider creating a view or a stored procedure to perform them behind the scenes so that your application code doesn't have to worry about it any more.
Other good database practices apply, of course - use transactions when you need to ensure multiple actions happen atomically (eg. trading), put indices on the fields you search most often, use VACUUM/OPTIMIZE TABLE/whatever during quiet periods to keep performance up, etc.
(Original answer below this point.)
To be honest I wouldn't store the quest requirement information in the relational database, but in some sort of script. Ultimately your idea of a 'requirement' takes on several varying forms which could draw on different sorts of data (eg. level, class, prior quests completed, item possession) and operators (a level might be a minimum or a maximum, some quests may require an item whereas others may require its absence, etc) not to mention a combination of conjunctions and disjunctions (some quests require all requirements to be met, whereas others may only require 1 of several to be met). This sort of thing is much more easily specified in an imperative language. That's not to say you don't have a quest table in the DB, just that you don't try and encode the sometimes arbitrary requirements into the schema. I'd have a requirement_script_id column to reference an external script. I suppose you could put the actual script into the DB as a text field if it suits, too.
Skill requirements are suited to the DB though, and quite trivial given the typical game system of learning skills as you progress through levels in a certain class:
table skill_levels
{
int skill_id FOREIGN KEY;
int class_id FOREIGN KEY;
int min_level;
}
myPotentialSkillList = SELECT * FROM skill_levels INNER JOIN
skill ON skill_levels.skill_id = skill.id
WHERE class_id = my_skill
ORDER BY skill_levels.min_level ASC;
Need a skill tree? Add a column prerequisite_skill_id. And so on.
Update:
Judging by the comments, it looks like a lot of people have a problem with XML. I know it's cool to bash it now and it does have its problems, but in this case I think it works. One of the other reasons that I chose it is that there are a ton of libraries for parsing it, so that can make life easier.
The other key concept is that the information is really non-relational. So yes, you could store the data in any particular example in a bunch of different tables with lots of joins, but that's a pain. But if I kept giving you a slightly different examples I bet you'd have to modify your design ad infinitum. I don't think adding tables and modifying complicated SQL statements is very much fun. So it's a little frustrating that #scheibk's comment has been voted up.
Original Post:
I think the problem you might have with storing quest information in the database is that it isn't really relational (that is, it doesn't really fit easily into a table). That might be why you're having trouble designing tables for the data.
On the other hand, if you put your quest information directly into code, that means you'll have to edit the code and recompile each time you want to add a quest. Lame.
So if I was you I might consider storing my quest information in an XML file or something similar. I know that's the generic solution for just about anything, but in this case it sounds right to me. XML is really made for storing non-relation and/or hierarchical data, just like the stuff you need to store for your quest.
Summary: You could come up with your own schema, create your XML file, and then load it at run time somehow (or even store the XML in the database).
Example XML:
<quests>
<quest name="Return Ring to Mordor">
<characterReqs>
<level>60</level>
<finishedQuests>
<quest name="Get Double Cheeseburger" />
<quest name="Go to Vegas for the Weekend" />
</finishedQuests>
<skills>
<skill name="nunchuks" />
<skill name="plundering" />
</skills>
<items>
<item name="genie's lamp" />
<item name="noise cancelling headphones for robin williams' voice />
</items>
</characterReqs>
<steps>
<step number="1">Get to Mordor</step>
<step number="2">Throw Ring into Lava</step>
<step number="3">...</step>
<step number="4">Profit</step>
</steps>
</quest>
</quests>
It sounds like you're ready for general object oriented design (OOD) principles. I'm going to purposefully ignore the context (gaming, MMO, etc) because that really doesn't matter to how you do a design process. And me giving you links is less useful than explaining what terms will be most helpful to look up yourself, IMO; I'll put those in bold.
In OOD, the database schema comes directly from your system design, not the other way around. Your design will tell you what your base object classes are and which properties can live in the same table (the ones in 1:1 relationship with the object) versus which to make mapping tables for (anything with 1:n or n:m relationships - for exmaple, one user has multiple stats, so it's 1:n). In fact, if you do the OOD correctly, you will have zero decisions to make regarding the final DB layout.
The "correct" way to do any OO mapping is learned as a multi-step process called "Database Normalization". The basics of which is just as I described: find the "arity" of the object relationships (1:1, 1:n,...) and make mapping tables for the 1:n's and n:m's. For 1:n's you end up with two tables, the "base" table and a "base_subobjects" table (eg. your "users" and "user_stats" is a good example) with the "foreign key" (the Id of the base object) as a column in the subobject mapping table. For n:m's, you end up with three tables: "base", "subobjects", and "base_subobjects_map" where the map has one column for the base Id and one for the subobject Id. This might be necessary in your example for N quests that can each have M requirements (so the requirement conditions can be shared among quests).
That's 85% of what you need to know. The rest is how to handle inheritance, which I advise you to just skip unless you're masochistic. Now just go figure out how you want it to work before you start coding stuff up and the rest is cake.
The thread in #Shea Daniel's answer is on the right track: the specification for a quest is non-relational, and also includes logic as well as data.
Using XML or Lua are examples, but the more general idea is to develop your own Domain-Specific Language to encode quests. Here are a few articles about this concept, related to game design:
The Whimsy Of Domain-Specific Languages
Using a Domain Specific Language for Behaviors
Using Domain-Specific Modeling towards Computer Games Development Industrialization
You can store the block of code for a given quest into a TEXT field in your database, but you won't have much flexibility to use SQL to query specific parts of it. For instance, given the skills a character currently has, which quests are open to him? This won't be easy to query in SQL, if the quest prerequisites are encoded in your DSL in a TEXT field.
You can try to encode individual prerequisites in a relational manner, but it quickly gets out of hand. Relational and object-oriented just don't go well together. You can try to model it this way:
Chars <--- CharAttributes --> AllAttributes <-- QuestPrereqs --> Quests
And then do a LEFT JOIN looking for any quests for which no prereqs are missing in the character's attributes. Here's pseudo-code:
SELECT quest_id
FROM QuestPrereqs
JOIN AllAttributes
LEFT JOIN CharAttributes
GROUP BY quest_id
HAVING COUNT(AllAttributes) = COUNT(CharAttributes);
But the problem with this is that now you have to model every aspect of your character that could be a prerequisite (stats, skills, level, possessions, quests completed) as some kind of abstract "Attribute" that fits into this structure.
This solves this problem of tracking quest prerequisites, but it leaves you with another problem: the character is modeled in a non-relational way, essentially an Entity-Attribute-Value architecture which breaks a bunch of relational rules and makes other types of queries incredibly difficult.
Not directly related to the design of your database, but a similar question was asked a few weeks back about class diagram examples for an RPG
I'm sure you can find something useful in there :)
Regarding your basic structure, you may (depending on the nature of your game) want to consider driving toward convergence of representation between player character and non-player characters, so that code that would naturally operate the same on either doesn't have to worry about the distinction. This would suggest, instead of having user and monster tables, having a character table that represents everything PCs and NPCs have in common, and then a user table for information unique to PCs and/or user accounts. The user table would have a character_id foreign key, and you could tell a player character row by the fact that a user row exists corresponding to it.
For representing quests in a model like yours, the way I would do it would look like:
quest_model
===============
id
name ['Quest for the Holy Grail', 'You Killed My Father', etc.]
etc.
quest_model_req_type
===============
id
name ['Minimum Level', 'Skill', 'Equipment', etc.]
etc.
quest_model_req
===============
id
quest_id
quest_model_req_type_id
value [10 (for Minimum Level), 'Horseback Riding' (for Skill), etc.]
quest
===============
id
quest_model_id
user_id
status
etc.
So a quest_model is the core definition of the quest structure; each quest_model can have 0..n associated quest_model_req rows, which are requirements specific to that quest model. Every quest_model_req is associated with a quest_model_req_type, which defines the general type of requirement: achieving a Minimum Level, having a Skill, possessing a piece of Equipment, and so on. The quest_model_req also has a value, which configures the requirement for this specific quest; for example, a Minimum Level type requirement might have a value of 20, meaning you must be at least level 20.
The quest table, then, is individual instances of quests that players are undertaking or have undertaken. The quest is associated with a quest_model and a user (or perhaps character, if you ever want NPCs to be able to do quests!), and has a status indicating where the progress of the quest stands, and whatever other tracking turns out useful.
This is a bare-bones structure that would, of course, have to be built out to accomodate the needs of particular games, but it should illustrate the direction I'd recommend.
Oh, and since someone else threw around their credentials, mine are that I've been a hobbyist game developer on live, public-facing projects for 16 years now.
I'd be extremely careful of what you actually store in a DB, especially for an MMORPG. Keep in mind, these things are designed to be MASSIVE with thousands of users, and game code has to execute excessively quickly and send a crap-ton of data over the network, not only to the players on their home connections but also between servers on the back-end. You're also going to have to scale out eventually and databases and scaling out are not two things that I feel mix particularly well, particularly when you start sharding into different regions and then adding instance servers to your shards and so on. You end up with a whole lot of servers talking to databases and passing a lot of data, some of which isn't even relevant to the game at all (SQL text going to a SQL server is useless network traffic that you should cut down on).
Here's a suggestion: Limit your SQL database to storing only things that will change as players play the game. Monsters and monster stats will not change. Items and item stats will not change. Quest goals will not change. Don't store these things in a SQL database, instead store them in the code somewhere.
Doing this means that every server that ever lives will always know all of this information without ever having to query a database. Now, you don't store quests at all, you just store accomplishments of the player and the game programatically determines the affects of those quests being completed. You don't waste data transferring information between servers because you're only sending event ID's or something of that nature (you can optimize the data you pass by only using just enough bits to represent all the event ID's and this will cut down on network traffic. May seem insignificant but nothing is insignificant in massive network apps).
Do the same thing for monster stats and item stats. These things don't change during gameplay so there's no need to keep them in a DB at all and therefore this information NEVER needs to travel over the network. The only thing you store is the ID of the items or monster kills or anything like that which is non-deterministic (i.e. it can change during gameplay in a way which you can't predict). You can have dedicated item servers or monster stat servers or something like that and you can add those to your shards if you end up having huge numbers of these things that occupy too much memory, then just pass the data that's necessary for a particular quest or area to the instance server that is handling that thing to cut down further on space, but keep in mind that this will up the amount of data you need to pass down the network to spool up a new instance server so it's a trade-off. As long as you're aware of the consequences of this trade-off, you can use good judgement and decide what you want to do. Another possibility is to limit instance servers to a particular quest/region/event/whatever and only equip it with enough information to the thing it's responsible for, but this is more complex and potentially limits your scaling out since resource allocation will become static instead of dynamic (if you have 50 servers of each quest and suddenly everyone goes on the same quest, you'll have 49 idle servers and one really swamped server). Again, it's a trade-off so be sure you understand it and make good choices for your application.
Once you've identified exactly what information in your game is non-deterministic, then you can design a database around that information. That becomes a bit easier: players have stats, players have items, players have skills, players have accomplishments, etc, all fairly easy to map out. You don't need descriptions for things like skills, accomplishments, items, etc, or even their effects or names or anything since the server can determine all that stuff for you from the ID's of those things at runtime without needing a database query.
Now, a lot of this probably sounds like overkill to you. After all, a good database can do queries very rapidly. However, your bandwidth is extremely precious, even in the data center, so you need to limit your use of it to only what is absolutely necessary to send and only send that data when it's absolutely necessary that it be sent.
Now, for representing quests in code, I would consider the specification pattern (http://en.wikipedia.org/wiki/Specification_pattern). This will allow you to easily build up quest goals in terms of what events are needed to ensure that the specification for completing that quest is met. You can then use LUA (or something) to define your quests as you build the game so that you don't have to make massive code changes and rebuild the whole damn thing to make it so that you have to kill 11 monsters instead of 10 to get the Sword of 1000 truths in a particular quest. How to actually do something like that I think is beyond the scope of this answer and starts to hit the edge of my knowledge of game programming so maybe someone else on here can help you out if you choose to go that route.
Also, I know I used a lot of terms in this answer, please ask if there are any that you are unfamiliar with and I can explain them.
Edit: didn't notice your addition about craftable items. I'm going to assume that these are things that a player can create specifically in the game, like custom items. If a player can continually change these items, then you can just combine the attributes of what they're crafted as at runtime but you'll need to store the ID of each attribute in the DB somewhere. If you make a finite number of things you can add on (like gems in Diablo II) then you can eliminate a join by just adding that number of columns to the table. If there are a finite number of items that can be crafted and a finite number of ways that differnet things can be joined together into new items, then when certain items are combined, you needn't store the combined attributes; it just becomes a new item which has been defined at some point by you already. Then, they just have that item instead of its components. If you clarify the behavior your game is to have I can add additional suggestions if that would be useful.
I would approach this from an Object Oriented point of view, rather than a Data Centric point of view. It looks like you might have quite a lot of (poss complex) objects - I would recommend getting them modeled (with their relationships) first, and relying on an ORM for persistence.
When you have a data-centric problem, the database is your friend. What you have done so far seems to be quite right.
On the other hand, the other problems you mention seem to be behaviour-centric. In this case, an object-oriented analisys and solution will work better.
For example:
Create a quest class with specificQuest child classes. Each child should implement a bool HasRequirements(Player player) method.
Another option is some sort of rules engine (Drools, for example if you are using Java).
If i was designing a database for such a situation, i might do something like this:
Quest
[quest properties like name and description]
reqItemsID
reqSkillsID
reqPlayerTypesID
RequiredItems
ID
item
RequiredSkills
ID
skill
RequiredPlayerTypes
ID
type
In this, the ID's map to the respective tables then you retrieve all entries under that ID to get the list of required items, skills, what have you. If you allow dynamic creation of items then you should have a mapping to another table that contains all possible items.
Another thing to keep in mind is normalization. There's a long article here but i've condensed the first three levels into the following more or less:
first normal form means that there are no database entries where a specific field has more than one item in it
second normal form means that if you have a composite primary key all other fields are fully dependent on the entire key not just parts of it in each table
third normal is where you have no non-key fields that are dependent on other non-key fields in any table
[Disclaimer: i have very little experience with SQL databases, and am new to this field. I just hope i'm of help.]
I've done something sort of similar and my general solution was to use a lot of meta data. I'm using the term loosely to mean that any time I needed new data to make a given decision(allow a quest, allow using an item etc.) I would create a new attribute. This was basically just a table with an arbitrary number of values and descriptions. Then each character would have a list of these types of attributes.
Ex: List of Kills, Level, Regions visited, etc.
The two things this does to your dev process are:
1) Every time there's an event in the game you need to have a big old switch block that checks all these attribute types to see if something needs updating
2) Everytime you need some data, check all your attribute tables BEFORE you add a new one.
I found this to be a good rapid development strategy for a game that grows organically(not completely planned out on paper ahead of time) - but it's one big limitation is that your past/current content(levels/events etc) will not be compatible with future attributes - i.e. that map won't give you a region badge because there were no region badges when you coded it. This of course requires you to update past content when new attributes are added to the system.
just some little points for your consideration :
1) Always Try to make your "get quest" requirements simple.. and "Finish quest" requirements complicated..
Part1 can be done by "trying to make your quests in a Hierarchical order":
example :
QuestA : (Kill Raven the demon) (quest req: Lvl1)
QuestA.1 : Save "unkown" in the forest to obtain some info.. (quest req : QuestA)
QuestA.2 : Craft the sword of Crystal ... etc.. (quest req : QuestA.1 == Done)
QuestA.3 : ... etc.. (quest req : QuestA.2 == Done)
QuestA.4 : ... etc.. (quest req : QuestA.3 == Done)
etc...
QuestB (Find the lost tomb) (quest req : ( QuestA.statues == Done) )
QuestC (Go To the demons Hypermarket) ( Quest req: ( QuestA.statues == Done && player.level== 10)
etc....
Doing this would save you lots of data fields/table joints.
ADDITIONAL THOUGHTS:
if you use the above system, u can add an extra Reward field to ur quest table called "enableQuests" and add the name of the quests that needs to be enabled..
Logically.. you'd have an "enabled" field assigned to each quest..
2) A minor solution for Your crafting problem, create crafting recipes, Items that contains To-be-Crafted-item crafting requirements stored in them..
so when a player tries to craft an item.. he needs to buy a recipe 1st.. then try crafting..
a simple example of such item Desc would be:
ItemName: "Legendary Sword of the dead"
Craftevel req. : 75
Items required:
Item_1 : Blade of the dead
Item_2 : A cursed seal
item_3 : Holy Gemstone of the dead
etc...
and when he presses the "craft" Action, you can parse it and compare against his inventory/craft box...
so Your Crafting DB will have only 1 field (or 2 if u want to add a crafting LvL req. , though it will already be included in the recipe.
ADDITIONAL THOUGHTS:
Such items, can be stored in xml format in the table .. which would make it much easier to parse...
3) A similar XML System can be applied to Your quest system.. to implement quest-ending requirements..

What is the "best" way to store international addresses in a database?

What is the "best" way to store international addresses in a database? Answer in the form of a schema and an explanation of the reasons why you chose to normalize (or not) the way you did. Also explain why you chose the type and length of each field.
Note: You decide what fields you think are necessary.
Plain freeform text.
Validating all the world's post/zip codes is too hard; a fixed list of countries is too politically sensitive; mandatory state/region/other administrative subdivision is just plain inappropriate (all too often I'm asked which county I live in--when I don't, because Greater London is not a county at all).
More to the point, it's simply unnecessary. Your application is highly unlikely to be modelling addresses in any serious way. If you want a postal address, ask for the postal address. Most people aren't so stupid as to put in something other than a postal address, and if they do, they can kiss their newly purchased item bye-bye.
The exception to this is if you're doing something that's naturally constrained to one country anyway. In this situation, you should ask for, say, the { postcode, house number } pair, which is enough to identify a postal address. I imagine you could achieve similar things with the extended zip code in the US.
In the past I've modeled forms that needed to be international after the ups/fedex shipping address forms on their websites (I figured if they don't know how to handle an international order we are all hosed). The fields they use can be used as reference for setting up your schema.
In general, you need to understand why you want an address. Is it for shipping/mailing? Then there is really only one requirement, have the country separate. The other lines are freeform, to be filled in by the user. The reason for this is the common forwarding strategy for mail : any incoming mail for a foreign country is forwarded without looking at the other address lines. Hence, the detailed information is parsed only by the mail sorter located in the country itself. Like the receiver, they'll be familiar with national conventions.
(UPS may bunch together some small European countries, e.. all the Low Countries are probably served from Belgium - the idea still holds.)
I think adding country/city and address text will be fine. country and city should be separate for reporting. Managers always ask for these kind of reports which you do not expect and I dont prefer running a LIKE query through a large database.
Not to give Facebook undue respect. However, the overall structure of the database seems to be overlooked in many web applications launching every day. Obviously I don't think there is a perfect solution that covers all the potential variables with address structure without some hard work. That said, combined with autocomplete Facebook manages to take location input data and eliminate a majority of their redundant entries. They do this by organizing their database well enough to provide autocomplete information in a low cost, low error way to the client in real time allowing them to more or less choose the correct location from an existing list.
I think the best solution is to access a third party database which contains your desired geographic scope and use it to initially seed your user location information. This will allow you to avoid doing the groudwork of creating your own. With any luck you can reduce the load on your server by allowing your new users to receive the correct autocomplete information directly off your third party supplier. Eventually you will be able to fill most autocomplete for location information such as city, country, etc. from information contained in your own database from user input data.
You need to provide a bit more details about how you are planning to use the data. For example, fields like City, State, Country can either be text in the single table, or be codes which are linked to a separate table with a Foreign Key.
Simplest would be
Address_Line_01 (Required, Non blank)
Address_Line_02
Address_Line_03
Landmark
City (Required)
Pin (Required)
Province_District
State (Required)
Country (Required)
All the above can be Text/Unicode with appropriate field lengths.
Phone Numbers as applicable.

Resources