Data Vault model for an insurance use case - data-modeling

I am currently working on an insurance use case and would like to apply a Data Vault 2.0 modeling approach to solve this challenge.
Following scenrio:
A contract was initially created on 28 March but will be effective on 01 April.
Now in June there was an adjustment and the premium was increased from 100 to 125 for July.
Again there was an adjustment in November and premium was increased from 125 to 150 for December.
To represent the different effective periods of the contract,
possible approach could be used for the satellite table with multiple active rows.
My experience with Data Vault has always been limited to use cases with one active record in the satellite table.
Question:
This feels not like real Data Vault 2.0 modeling approach to me.
How this use case could be simplified and solved with a Data Vault 2.0 modeling approach? How could the data model look like?
Is it even possible to apply an INSERT-only modeling approach on that use case?

Related

Microsoft cubes usage boundaries and best practicies

So we're thinking about using cubes in our organization.
Situation AS IS:
DWH (Azure MS SQL) Query language - SQL
Microsoft Column Storage (Not real cubes) Query language DAX (There is MDX support, but looks like it's poorly implemented - inefficient)
Tableau (BI system, reports) Can use SQL and MDX
Known problems:
When we use MDX there is aggregation problem by date (we should show year, month, date hierarchy in the query), there is no such problem with DAX.
Microsoft Column Storage inefficient running total calculating.
How we want to solve the problem right now:
Use Microsoft Column Storage, materializing running total but won't use this kind of "cube" in all reports, only for a few people that really need it
In DWH materializing running total. All Tableau reports using it
In DWH we have data with daily granulation (Ex: We have a record that changed 1st November, 5th November, 15th November, before we have 3 records in DWH now we'll have 15). We need it like this to be able to have up to any date data really fast (basically we're implementing our own cube line this)
Pros:
No one will need to go in-depth with DAX and MDX languages
We shouldn't refactor anything
Cos:
DWH upload(update) will become longer than right now
DWH will become bigger (an everyday data for records)
We need to maintain running total fields in a manual way
Known alternatives:
Microsoft Power BI - can use DAX and MDX really efficient
Microsoft Analysis Services Cube (Real cubes) - MDX efficient on this as long as we concern, not like in Microsoft Column Storage
Questions:
First: if it's possible I really want to have your impression of technologies that you've used to understand what and why causes pain when you develop and maintain the solution.
Second: it will be really appreciated if you'll have any criticism on our current approach - why is that bad?
Third: Are cubes dead? I mean google doesn't present its own cubes, maybe the technology of itself is a dead-end?
Last: if you have any advice on what we need to use - that will be great.
I am trying to answer it step by step based on my experiences, Question is way too large for single technology or person.
First: if it's possible I really want to have your impression of
technologies that you've used to understand what and why causes pain
when you develop and maintain the solution.
Warehousing, cube, reporting, querying is moving fast on different distributed technology which can scale horizontally on relatively cheap hardware, scale up/down on demand and also can scale quickly. Also size of data is ever increasing with rise in Bandwidths of internet, globalization, social networking and various reasons. Hadoop, Cloud initially fill in gap for distributed tech that can evolve on distributed horizontally & can scale up/down easily.
Having a sql server with high computation & High RAM for in-memory high data, mdx, cube is usually vertical scaling, is costly & can't be scaled down back as easily as distributed horizontally even if we have SQL server on cloud.
Now with advantages comes complexities of developing Bigdata solution, learning curve & maintenance which is again a big challenge for new adopters who are not familiar with it till now.
Second: it will be really appreciated if you'll have any criticism on
our current approach - why is that bad
There is no golden bullet or silver lining architecture that can solve every issue you face without facing some issues of it's own. Your approach is again viable & has it's pro's & cons based on your current organisation structure. What I am assuming your team is familiar with SQL server, mdx , cubes & column storage and also done feasibility analysis. Only issue I see is when size of data increases SQL demands more computing power & RAM that can mostly be done by upgrading VM/machine. Vertical Scaling is costly & there is always limit at some time. Also failover/DR on such infra is again more costly.
Third: Are cubes dead? I mean google doesn't present its own cubes,
maybe the technology of itself is a dead-end?
No technology is dead if you can find support for it, even assembly, C, C++, Cobol is still going strong for old projects and for cases where it fit better than other.
Last: if you have any advice on what we need to use - that will be
great.
Do POC(proof of concepts) for at-least 3-4 types of solutions/architecture, what suits you best with cost/skill/timeframe, you will be the best judge.
I can suggest if you are open to cloud based solution try exploring some other solutions like data lake with azure data factory for Proof of concepts if it can meet your requirements.
Also I came through one out-of-box solution from Microsoft quite recently which is worth looking: Azure Synapse Analytics(https://azure.microsoft.com/en-in/services/synapse-analytics/). It has in built support of dataware housing, querying, support for AI/BI, streaming, data lake exploration, security, scale, support for Spark and various other sources with PowerBI too, insights/visual display.

Selecting and aggregating data from multiple databases

I'm looking for some advice, so apologies if this is the wrong section (I will delete and post in DB Administrators if that is more appropriate).
I have an application that manages legal cases. Each law firm has its own database for managing their own cases and other business processes. Therefore there can be potentially hundreds of databases in total.
A citizen may have cases with many law firms. Some of these cases are in various "stages" of their life (see below for example)
I want my application to show the user an aggregation of all their cases in total, by stage and status. So to the user it will look like something from an email client:
Cases (15)
Open (10)
Under Examination (4)
Escalated (6)
Closed (5)
To get these numbers I have to select all their cases from each database for each law firm this user has a case with. In pseudo-code I'll have to:
SELECT
c1.Title, c1.Stage
FROM
[#DatabaseName].dbo.Cases c1
WHERE
c1.CustomerID = #CustomerUserID
It has to do this for EACH database (potentially 50+) and then UNION all the results together to get a total for all Open cases, by Stage, that this citizen has.
The only other way I can think of to make this simpler is to not have a multi-tennant architecture. That is, just have 1 table for Cases in 1 database which has a column to identify which law firm it belongs to. The problem is that this solution won't scale well if the application grows into other areas such as HR or Finance etc. It will mean that I will have to store the entirety of every law firms data in 1 database which will kill my server and resources.
Is there a better way to tackling this?
Check Azure Elastic Scale Api
This provides APIs to handle multiple shards using a shard management database.
You can create a shard for Each Law Firm.

Parent/Child key relationship design with app-engine data store

I am implementing a simple application for expenses reporting.The application will use GAE. In my application I have several entities (Classes) like Year, Month,Day, Expenses, Account and so on. The picture is as follow: a user can create an Account, then start to declare expenses with a simple form. The expenses are stored in GAE Datasotre. Every Year has Months, every month has Days and every day has a declared Expenses.the problem is that i don't know how to arrange theses entities in the non-relational database of GAE. I read several tutorial and articles from Google Developers website, but still don't understand the concept of Parent/child relationship and groups of entities. Anyone can help with some tutorial,video, articles or books on how to design the relationship and store your entities in a Non-relations database like GAE Data store. thanks in advance. I forget to mention that I would like to use GAE low-level data store.
If you are using java, I would suggest using objectify. It's just so much easier than JPA, for me at least.
You are paying by the read and write, so if for instance you can fit all of the data for a month in 1mb, then I would not have a separate entity for day. Anyway, I don't understand your requirements like why year has to be an entity and can just be a property that you filter by. I would actually think about just having a Day entity with Year, and Month properties to filter by.
http://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Relationships
In MongoDb you would have "embeded" documents. I dont know if GAE is as evolved as MongoDB - I suspect not. Perhaps you should look at another better documented NOSQL database if you are having problems with documetation at this stage. I'd have a look at the MongoDB site anyway, so if you ahve a background in SQL, you can see the mapping in terminology between the two cultures. Of course any NOSQL database is inherantly non transactional, so when the app develops to track expense payments, there may be some insuperable issues later.

Need help (re)designing a database for a self-storage business

I apologize in advance for the overly-long post, but I wanted to provide as much information as possible.
I've been building web database apps for years, but I'm stuck on a redesign that I've been wanting to do forever and I cannot seem to figure out. I'm hoping for a nudge in the right direction.
The business is a self-storage rental business. Tenants typically sign a contract/rental agreement with no specific end date and usually (but not always) pay each month on roughly the same day as their start date. So if you move in on the 16th, your rent is due on the 16th of every month (ignore those last few days of the month that not all months have). Some folks pay 6 months in advance, etc.
Here are the tables that I know/think I need.
Tenants (basic contact info)
Storage Units (information about each unit, square footage, maybe base monthly rate unless I decide to stick those values in a separate table so that I can easily change the base rate for all units of the same type)
Tenants <-> Storage Units (One customer can rent multiple units, but each unit can only have 1 or 0 tenants so maybe this doesn't need it's own table.)
The part that I am stuck on is the best way to handle the billing - no invoices are generated - tenants just need to remember to pay. I want to build this as a web app on a hosted provider so I don't know if I should be thinking about handling this with a nightly db script or from the web app side or if I can design it in a way that nothing special needs to happen each night.
Here's a simple example of what I need.
Joe rents Unit #1 on 1/1/2011 and pays for 1 month of rent ($50/month).
How do I build this so that if Joe doesn't pay for month 2 on 2/1/2011 he appears as past due and owing $50 (late fees also kick in after being late for 5 days and you are flagged for being locked out of your unit if you don't pay for 10 days). The employees have some discretion on late fees & lock outs if it's a long-time customer and they were just a little late, etc.
I think the most similar real-world design might be a cable television provider, but I'm having trouble putting it all together without making it overly complicated.
I will likely be building this in .NET 4 with MS SQL 2005/2008, but I don't think this is a technology-specific question.
If the three tables you listed are "obvious", it's equally clear that you'll need some tables for accounts payable and receivable. If you don't already have something for the accounting side of the business you'll have to add it into your site.
I'd wonder about any business that didn't have some kind of accounting software in place. The best thing, in my opinion, would be to find a way to let your web app tap into that data source in a secure, read-only way.

Why do we need a temporal database?

I was reading about temporal databases and it seems they have built in time aspects. I wonder why would we need such a model?
How different is it from a normal RDBMS? Can't we have a normal database i.e. RDBMS and say have a trigger which associates a time stamp with each transaction that happens? May be there would be a performance hit. But I'm still skeptical on temporal databases having a strong case in the market.
Does any of the present databases support such a feature?
Consider your appointment/journal diary - it goes from Jan 1st to Dec 31st. Now we can query the diary for appointments/journal entries on any day. This ordering is called the valid time. However, appointments/entries are not usually inserted in order.
Suppose I would like to know what appointments/entries were in my diary on April 4th. That is, all the records that existed in my diary on April 4th. This is the transaction time.
Given that appointments/entries can be created and deleted etc. A typical record has a beginning and end valid time that covers the period of the entry and a beginning and end transaction time that indicates the period during which the entry appeared in the diary.
This arrangement is necessary when the diary may undergo historical revision. Suppose on April 5th I realise that the appointment I had on Feb 14th actually occurred on February 12th i.e. I discover an error in my diary - I can correct the error so that the valid time picture is corrected, but now, my query of what was in the diary on April 4th would be wrong, UNLESS, the transaction times for appointments/entries are also stored. In that case if I query my diary as of April 4th it will show an appointment existed on February 14th but if I query as of April 6th it would show an appointment on February 12th.
This time travel feature of a temporal database makes it possible to record information about how errors are corrected in a database. This is necessary for a true audit picture of data that records when revisions were made and allows queries relating to how data have been revised over
time.
Most business information should be stored in this bitemporal scheme in order to provide a true audit record and to maximise business intelligence - hence the need for support in a relational database. Notice that each data item occupies a (possibly unbounded) square in the two dimensional time model which is why people often use a GIST index to implement bitemporal indexing. The problem here is that a GIST index is really designed for geographic data and the requirements for temporal data are somewhat different.
PostgreSQL 9.0 exclusion constraints should provide new ways of organising temporal data e.g. transaction and valid time PERIODs should not overlap for the same tuple.
A temporal database efficiently stores a time series of data, typically by having some fixed timescale (such as seconds or even milliseconds) and then storing only changes in the measured data. A timestamp in an RDBMS is a discretely stored value for each measurement, which is very inefficient. A temporal database is often used in real-time monitoring applications like SCADA. A well-established system is the PI database from OSISoft (http://www.osisoft.com/).
As I understand it (and over-simplifying enormously), a temporal database records facts about when the data was valid as well as the the data itself, and permits you to query on the temporal aspects. You end up dealing with 'valid time' and 'transaction time' tables, or 'bitemporal tables' involving both 'valid time' and 'transaction time' aspects. You should consider reading either of these two books:
Darwen, Date and Lorentzos "Temporal Data and the Relational Model" (out of print),
and (at a radically different extreme) "Developing Time-Oriented Database Applications in SQL", Richard T. Snodgrass, Morgan Kaufmann Publishers, Inc., San Francisco, July, 1999, 504+xxiii pages, ISBN 1-55860-436-7. That is out of print but available as PDF on his web site at cs.arizona.edu (so a Google search makes it pretty easy to find).
Temporal databases are often used in the financial services industry. One reason is that you are rarely (if ever) allowed to delete any data, so ValidFrom - ValidTo type fields on records are used to provide an indication of when a record was correct.
Besides "what new things can I do with it", it might be useful to consider "what old things does it unify?". The temporal database represents a particular generalization of the "normal" SQL database. As such, it may give you a unified solution to problems that previously appeared unrelated. For example:
Web Concurrency When your database has a web UI that lets multiple users perform standard Create/Update/Delete (CRUD) modifications, you have to face the concurrent web changes problem. Basically, you need to check that an incoming data modification is not affecting any records that have changed since that user last saw those records. But if you have a temporal database, it quite possibly already associates something like a "revision ID" with each record (due to the difficulty of making timestamps unique and monotonically ascending). If so, then that becomes the natural, "already built-in" mechanism for preventing the clobbering of other users' data during database updates.
Legal/Tax Records The legal system (including taxes) places rather more emphasis on historical data than most programmers do. Thus, you will often find advice about schemas for invoices and such that warns you to beware of deleting records or normalizing in a natural way--which can lead to an inability to answer basic legal questions like "Forget their current address, what address did you mail this invoice to in 2001?" With a temporal framework base, all the machinations to those problems (they usually are halfway steps to having a temporal database) go away. You just use the most natural schema, and delete when it make sense, knowing that you can always go back and answer historical questions accurately.
On the other hand, the temporal model itself is half-way to complete revision control, which could inspire further applications. For example, suppose you roll your own temporal facility on top of SQL and allow branching, as in revision control systems. Even limited branching could make it easy to offer "sandboxing" -- the ability to play with and modify the database with abandon without causing any visible changes to other users. That makes it easy to supply highly realistic user training on a complex database.
Simple branching with a simple merge facility could also simplify some common workflow problems. For example, a non-profit might have volunteers or low-paid workers doing data entry. Giving each worker their own branch could make it easy to allow a supervisor to review their work or enhance it (e.g., de-duplification) before merging it into the main branch where it would become visible to "normal" users. Branches could also simplify permissions. If a user is only granted permission to use/see their unique branch, you don't have to worry about preventing every possible unwanted modification; you'll only merge the changes that make sense anyway.
Apart from reading the Wikipedia article? A database that maintains an "audit log" or similar transaction log will have some properties of being "temporal". If you need answers to questions about who did what to whom and when then you've got a good candidate for a temporal database.
You can imagine a simple temporal database that just logs your GPS location every few seconds. The opportunities for compressing this data is great, a normal database you would need to store a timestamp for every row. If you have a great deal of throughput required, knowing the data is temporal and that updates and deletes to a row will never be required permits the program to drop a lot of the complexity inherit in a typical RDBMS.
Despite this, temporal data is usually just stored in a normal RDBMS. PostgreSQL, for example has some temporal extensions, which makes this a little easier.
Two reasons come to mind:
Some are optimized for insert and read only and can offer dramatic perf improvements
Some have better understandings of time than traditional SQL - allowing for grouping operations by second, minute, hour, etc
Just an update, Temporal database is coming to SQL Server 2016.
To clear all your doubts why one need a Temporal Database, rather than configuring with custom methods, and how efficiently & seamlessly SQL Server configures it for you, check the in-depth video and demo on Channel9.msdn here: https://channel9.msdn.com/Shows/Data-Exposed/Temporal-in-SQL-Server-2016
MSDN link: https://msdn.microsoft.com/en-us/library/dn935015(v=sql.130).aspx
Currently with the CTP2 (beta 2) release of SQL Server 2016 you can play with it.
Check this video on how to use Temporal Tables in SQL Server 2016.
My understanding of temporal databases is that are geared towards storing certain types of temporal information. You could simulate that with a standard RDBMS, but by using a database that supports it you have built-in idioms for a lot of concepts and the query language might be optimized for these sort of queries.
To me this is a little like working with a GIS-specific database rather than an RDBMS. While you could shove coordinates in a run-of-the-mill RDBMS, having the appropriate representations (e.g., via grid files) may be faster, and having SQL primitives for things like topology is useful.
There are academic databases and some commercial ones. Timecenter has some links.
Another example of where a temporal database is useful is where data changes over time. I spent a few years working for an electricity retailer where we stored meter readings for 30 minute blocks of time. Those meter readings could be revised at any point but we still needed to be able to look back at the history of changes for the readings.
We therefore had the latest reading (our 'current understanding' of the consumption for the 30 minutes) but could look back at our historic understanding of the consumption. When you've got data that can be adjusted in such a way temporal databases work well.
(Having said that, we hand carved it in SQL, but it was a fair while ago. Wouldn't make that decision these days.)

Resources