historical data modelling literature, methods and techniques - database

Last year we launched http://tweetMp.org.au - a site dedicated to Australian politics and twitter.
Late last year our politician schema needed to be adjusted because some politicians retired and new politicians came in.
Changing our db required manual (SQL) change, so I was considering implementing a CMS for our admins to make these changes in the future.
There's also many other sites that government/politics sites out there for Australia that manage their own politician data.
I'd like to come up with a centralized way of doing this.
After thinking about it for a while, maybe the best approach is to not model the current view of the politician data and how they relate to the political system, but model the transactions instead. Such that the current view is the projection of all the transactions/changes that happen in the past.
Using this approach, other sites could "subscribe" to changes (a la` pubsubhub) and submit changes and just integrate these change items into their schemas.
Without this approach, most sites would have to tear down the entire db, and repopulate it, so any associated records would need to be reassociated. Managing data this way is pretty annoying, and severely impedes mashups of this data for the public good.
I've noticed some things work this way - source version control, banking records, stackoverflow points system and many other examples.
Of course, the immediate challenges and design issues with this approach includes
is the current view cached and repersisted? how often is it updated?
what base entities must exist that never change?
probably heaps more i can't think of right now...
Is there any notable literature on this subject that anyone could recommend?
Also, any patterns or practices for data modelling like this that could be useful?
Any help is greatly appreciated.
-CV

This is a fairly common problem in data modelling. Basically it comes down to this:
Are you interesting in the view now, the view at a point in time or both?
For example, if you have a service that models subscriptions you need to know:
What services someone had at a point in time: this is needed to work out how much to charge, to see a history of the account and so forth; and
What services someone has now: what can they access on the Website?
The starting point for this kind of problem is to have a history table, such as:
Service history: id, userid, serviceid, start_date, end_date
Chain together the service histories for a user and you have their history. So how do you model what they have now? The easiest (and most denormalized view) is to say the last record or the record with a NULL end date or a present or future end date is what they have now.
As you can imagine this can lead to some gnarly SQL so this is selectively denomralized so you have a Services table and another table for history. Each time Services is changed a history record is created or updated. This kind of approach makes the history table more of an audit table (another term you'll see bandied about).
This is analagous to your problem. You need to know:
Who is the current MP for each seat in the House of Representatives;
Who is the current Senator for each seat;
Who is the current Minister for each department;
Who is the Prime Minister.
But you also need to know who was each of those things at a point in time so you need a history for all those things.
So on the 20th August 2003, Peter Costello made a press release you would need to know that at this time he was:
The Member for Higgins;
The Treasurer; and
The Deputy Prime Minister.
because conceivably someone could be interesting in finding all press releases by Peter Costello or the Treasurer, which will lead to the same press release but will be impossible to trace without the history.
Additionally you might need to know which seats are in which states, possibly the geographical boundaries and so on.
None of this should require a schema change as the schema should be able to handle it.

Related

Customer Deduplication in Booking Application

We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.
The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:
If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.
I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).
I would appreciate very much any help so I can drive the engineering team to the right direction.
In General
What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?
It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.
If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.
If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.
In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.
Specifics
...check if the (newly created) customer
for this reservation has already an old customer id (by comparing
email...
Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.
... and other aspects.
What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?
If it has one or more old reservations, detach that reservation from
the old customer id, and link it to a new customer id. Literally by
changing the customer ID of that old reservation to the newly created
customer.
Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.
As we have several operational applications relying on that data, this creates a massive sync issue.
...case in point.
In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.
Option - Phased Introduction Based on Testing
You could try this:
Keep the system as-is for now.
Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.
Alternative Customer/Session Approach
Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
Allow users to optionally be validated as "Customers" (big C).
Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.
I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.
I wouldn't say that's a terrible design, it's just a simple approach of solving this particular problem, with some room for improvement. It's not optimal because the runtime of that job depends on the new bookings that are received during the day, which may vary from day to day, so other workflows that depend on that will be impacted.
This approach can be improved by processing new bookings in parallel, and using an index to get a fast lookup when checking if a new e-mail already exists or not.
You can also check out Bloom Filters - an efficient data structure that is able to tell you if an element is not in a given set.
The way I would do it is to store the bookings in a No-SQL DB table keyed-off the user email. You get the user email in both situations - when it has an account or when it makes a booking without an account, so you just have to make a lookup to get the bookings by email, which makes that deduplication job redundant.

memgraphdb: Support for time travel queries in graph databases

Let's say I want to model a graph with sales people. They belong to an organisation, have a manager, etc.. They are assigned to specific territories and/or client accounts. Your company may work with external partners, which must be managed, and so on. A nice, none-trivial, graph.
Elements in this graph keep on changing all the time: sales people come and go, or move within the organisation and thus change responsibilities; customers sign contract or cancel them, ...
In my specific use cases, the point in time is very important. How did the graph look like at the end of last month? End of last fiscal year? last Monday when we run job ABC. E.g. what was the manager hierarchy end of last month? Which clients did the sale person manage end of last month? and so on.
In our use cases, DELETE doesn't delete anything, but some sort of end_date gets updated. UPDATE doen't update anything, but a new version of the record is created.
I'm sure I can add CREATED and START-/END_DATE properties to nodes as well as relations, and for sure I can also create queries. But these queries are a pain to write, and almost unreadable, with tons of repeating where clauses everywhere.
I wish graph databases (and their graphical query builder) would allow me to travel in time more easily, e.g. by setting a session variable to a point in time, and all the where clauses are automatically added for all nodes and references that have the start/end date properties. The algorithm should not fail for objects that don't have these properties, but consider the condition met.
What are you thoughts about this use case und what help does memgraph provide for these use cases?
thanks a lot
Juergen
As far as I am aware there is not any graph database that supports the type of functionality you are asking about directly although as #buda points out you can model and query against time series data. I agree with #buda that the way in which you would like this to work seems a bit undefined and very application specific so I would not expect this to be a feature of any database.
The closest I can think of to out of the box support for something like this would be to use a Tinkerpop-enabled database with a PartitionStrategy or SubgraphStrategy to create the subgraph of only the times you wanted and then query against that. Another option would be creating a domain specific language to minimize the amount of times you need to repeated code in your queries.
PartitionStrategy
SubgraphStrategy
Domain Specific Languages

Database schema naming conventions and common mistakes?

(https://i.stack.imgur.com/VYkV6.png) :
I'm asked to design a relational database to keep data to answer clinic operation queries such as:
● List the patient appointments for each doctor for a given date.
● When a patient rings to make an appointment, give the available time slots for a given date.
● Retrieve the address of patients to send notices via mail services.
I have one database schema of one relation as shown below, but I was wondering whether there were any mistakes I've made?
ABC(doc-name, doc-gender, registration_num, qualification, pat-name, pat-gender, DOB, address, phone-num, appoint-date, appoint-time, type)
Is the use of words such as date and the use of hyphens generally discouraged? Are there any other weaknesses in my design?
Thank you
So, that's not a schema or a design. Not for a relational database, which, based on the tags for the question, is what you're looking for. That's the storage definition for an ID/Value style of database. If you're looking for actual relational storage, you should be building out those relationships through the process of normalization.
For example, let's start at the beginning with doc-name (I am personally not crazy about using hyphens, but it's not a showstopper, so at least on that note, be sure whichever RDBMS you're working with supports them in the name and then you're good to go). If we think about this just from a data entry stand point, we don't want to have to type in the name of the doctor every time we use that doctor. Instead, we'd want to pull that from a list. So, clearly, we can break that apart from the rest of the information. There is the beginning of our normalization process. We can also easily note the fact that a patient is likely to have more than one appointment. Under the current structure, we'd have to re-enter every bit of patient information prior to the appointment. There's another place where we'd break this apart.
There is tons more to this simple example that could be split out and normalized.
I'd suggest you read up on data normalization. My favorite teacher on the subject is Louis Davidson. Here's his book on the topic. Read that and then try to readdress the situation you're facing.
I'm assuming this isn't just homework. If it is, currently, I'd give you an "F". If it isn't, you should track down someone to give you hand with this database design. You won't be able to quickly read Louis' book on the topic and turn around even a rough working design in any reasonable period of time.
I have to second what Grant said, this is not a relational design at all.
Stop and ask yourself for example what happens if Steven Arrow has to take an afternoon off and update his schedule. You need to be very careful updating the database lest you reassign all his patients.
Spending a total of 5 minutes on this, I see at the very least:
A Doctors table, a Patients Table, and probably a table of open appointment times (which btw, is a bit harder than you think, so you have to give some thought how to handle that and some reading up on tables for scheduling).
That's for starters. I might break out Patients phone numbers to its own table. Why? Well how many columns do you want have for phone numbers? 1? What if they have a work AND home number? Or a Work and Cell and Home? And more.
The concept you're looking for is normal forms. You don't need to go overboard, but generally 3NF is about right.

Approaches to building an efficient database

I was wondering if I could draw on experienced database designers. I;m not very experienced with building databases and I have a project to complete within set time (couple of months). The requirement is to build it to third normal form.
I was just wondering what the best way to start would be? Should I go ahead and build something that works (trying to be as efficient as possible) as then go back and refactor any parts that can be improved or is there some methodology to follow to ensure certain degrees of normalisation from the start?
Do not build a database with the intent to go back later and normalize it. Databases do not refactor as easily as application code for one thing and for another, you already have a requirement to follow the third normal form, so do so from the start, it will be far less work and in the end a far better product. There is not shortcut to correct database design in the third normal form. Just get to understand clearly what it is before starting to deisgn.
If you are unfamiliar with database dsign in the third normal form, go spend a couple of days doing some reading on database design. THen spend a day or so reading about performance tuning a database - designing a database from the start to perform well will save you a lot of time. It doesn't take longer than designing a poorly performing datbase, once you know what to do and what to avoid. So my first advice to you is to get a couple of big fat books and read in depth before you start. The time you spend doing this will be well repaid in your ability to do things better through the rest of the project.
You will certainly write many layers (DAL, BAL, UI) on top of this database layer. Every change at a lower level will bubble upwards and cause more re-work than just at the database layer.
I think you should do all your design and get the database to the best normalization required/needed to save your data without anomalies at the outset.
Unlike Front-Side programming, refactoring a database strucure, once in use, is not only painful, but impractical.
I strongly recommend getting the structure of the back-end right up front. THe good news is that third normal form is not generally difficult to acheive. In fact, once you get a little practive, it becomes almost second nature to think of data in those terms.
The Third normal form essentially states that each component of a single record in a table be unique to the key for that record. COnsider:
tblPerson:
PersonID PK
LastName
FirstName
MiddleName
DOB
SSN
IncomeSource
IncomeAmount
The Table as defined above is NOT in thrid normal form, since income source is not unique to each incidence of Person in the table, and in fact is subject to change over the life of that person. Instead, one would structure the above (in overly simplistic terms) as follows:
tblPerson
PersonID PK
LastName
FirstName
MiddleName
DOB
SSN
tblPersonIncome
PersonIncomeID PK
PersonID FK tblPerson
IncomeSourceID FK tblIncomeSource
IncomeAmount
AsOfDate
tblIncomeSource
IncomeSourceID
IncomeSource
Note that some will argue about my use of auto-incrementing ID fields above - While these do not violate the 3rd normal form, they also do not utilize a natural key.
If your back-end is properly normalized from jump, adding or extending your entities is much easier to do in the furure, and will pose a much smaller impact upon any front-end you create prior to the additions.
Anyway, design correctly out front on your back end, or the misery will be intense once you get rolling and need to implement structural changes later. To do otherwise would be a little like pouring the foundation for your new house, and planning to change the floorplan to suit your needs later, while building it.

Relational database data explorer / visualization?

Is there a tool that can let one browse relational data as a graph of connected nodes?
For example, i'm faced with trying to cleanse some anomolous data. i can start with two offending rows. In this particular example, the TransactionID should, by business rules, be unique to the table, but i find a transaction that violates that rule:
SELECT * FROM LCTTrans
WHERE TransactionID = 1075048
LCTID TransactionID
========= =============
4358 1075048
4359 1075048
2 row(s) affected
But really what i want to begin to hunt down all the related data, to try to see which is right. So this hypothetical software would start by showing me these two rows:
Next, i want to see that transaction that is linked into this table:
Now that transaction points to an MAL, so show me that:
Now lets add those two LCTs, that the transaction is "on". A transaction can be on only one LCT, yet this one is pointing to two:
Okay computer, both of those LCTs point to an MAL and the transaction that created them, show me those:
Those last two transactions, they also point at an MAL, and they themselves point to an LCT, show me those:
Okay, now are there any entries in LCTTrans that point to LCTs 4358 or 4359?...
And so on, and so on.
Now i did all this manually, running single selects, copying and pasting uniqueidentifier keys and converting them into friendly id numbers so i could easily see the relationships.
Is there software that can do this?
Ok, well I liked this idea so much that I've written it.
It's not released yet, but when it is it will be free.
Edit
Ok, it's now released. Free relational database exploring goodness # http://www.atlantis-interactive.co.uk/products/datasurf/default.aspx
Edit
Although initially free, this is now part of Pragmatic Works' DBA xPress package.
DBeauty is a powerful data browser (similar to Matt Whitfield's excellent DataSurf but more powerful). It is Java based, so you need to download the JDBC driver. I've found this tool invaluable in quickly navigating data (I fell in love with Microsoft's Quadrant before they killed it off and have been looking for a replacement ever since).
Old but good and free DB subsetting tool Jailer should be able to answer the question.
http://jailer.sourceforge.net/
Yes, i would advice you to look into DbSchema, it's a neet database management tool that will help you.
I can think of a few for relational data (RDF, Topic Map and conceptual graph browsers), but none off-hand for SQL. You could try and translate your queries to a relational language the browsers understand. You also might be able to build something on top of skyrails. Most of the visualisations I've tagged on delicious are for graph or relational data, but again tend to be schema free rather than SQL.
Basically you write a dedup tool where you show both records onthe screen side by side with the ability to pick the record you wan to keep but to check individual data from the other record to keep as well. Since deduping is very differnt from database to database and highly dependant on the specific table structure and business rules you have (as well as knowledge about which things must be looked at for the type of deduping you are doing as they typically only show the most important relationship tables on screen), I have never seen one that wasn't built in house.
But if you want a quick look at all the data write a query that left joins to all the child tables and shows all the fields for both transactionids. Then read through your results.
More importantly, how did you end up with a dup if you hav ea business rule that requires the transactionid to be uninique. Did you forget that all of these types of rules must be enfoced through the datbase and not the application? Why was there no unique index on that field?
I've looked for open source software that can do this sort of link analysis, without much success. If you have enough of a budget to go proprietary, you might consider talking to Palantir Technologies, Centrifuge Systems, i2, etc. about analytics platforms and visualization technologies.
Try This tool - it is in russian, but interface is comprehensive http://sourceforge.net/projects/basescan/. Navigation in base is through drag and drop.

Resources