I'm working on the design of a new periodic snapshot fact table. I'm looking into health insurance claims and the amount of money people owe to the insurance company and the amount they've already paid. Data in the table will look like this.
CLAIM_ID TIME_KEY AMOUNT_OWED PAID
123 31.1.2000 1000 0
123 28.2.2000 900 100
123 31.3.2000 800 200
123 30.4.2000 0 1000
123 31.5.2000 0 1000
123 30.6.2000 0 1000
123 31.7.2000 0 1000
123 31.8.2000 0 1000
...
As you can see after 30.4.2000 it doesn't make sense to insert new data for claim_id 123 as it no longer changes (there is a reasonable degree of certainty this won't happen). Is it a good idea to stop inserting data for this claim or should I do so till the end of time :)?
I'm mainly concerned about sticking to best practices when designing Data Warehouse tables.
Thanks for any answer!
just a few thoughts...
Unless you can have multiple payments in a day against a claim (and potentially other transactions e.g. interest that increases the amount owed), then what you have shown is not really a snapshot fact, it is a transactional fact. The normal example given is a bank account where you have multiple in/out transactions per day and then a snapshot of the end-of-day (or end-of-month) position. Obviously I don't know your business model but it seems unlikely that there would be multiple transactions per day against a single claim
If there have been no changes to a claim since the last fact record was created there seems little point creating a new fact record
Typically you choose a periodic snapshot if you have
a) a large number of transactions and
b) you need an effective access to the data at some point of time (end of the month in your case)
If you have say 50 claim transactions per month and the claim is active one year on average, you will profit from this design even if you will hold the inactive claims for 50 years (which you will probably will not do;)
Your doubts suggest that you have not so much transactions per claim life cycle. It that case you should consider a fact table storing each transaction.
You will have definitively no overhead for inactive claims, but to get a snapshot information at a specific time you'll have to read the whole table.
On the contrary the periodic snapshot is typically partitioned on the snapshot time, so the access is very affective.
get no free lunch with saving space and an effective access.
I'm maintaining a system where users create something called "books" that are accessed by other users.
I need a convenient (good performance) way to store events in database where users visit these books to later display graphs with statistics. The graphs need to demonstrate a history where the owner of the book can see which days in the week, and at which times there is more visiting activity (all over the months).
Using ERD (Entity-Relationship-Diagram), I can produce the following Conceptual Model:
At first the problem seems to be solved, as we have a very simple situation here. This will give me a table with 3 fields. One will be the occurrence of the visit event, and the other 2 will be foreign keys. One represents the user, while the other represents which book was visited. In short, every record in this table will be a visit:
However, thinking that a user can average about 10 to 30 book visits per day, and having a system with 100.000 users, in a single day this table can add many gigabytes of new records. I'm not the most experienced person in good database performance practices, but I'm pretty sure that this is not the solution.
Even though I do a cleanup on the database to delete old records, I need to keep a record history of the last 2 months of visits (at least).
I've been looking for a way to solve this for days, and I have not found anything yet. Could someone help me, please?
Thank you.
OBS: I'm using PostgreSQL 9.X, and the system is written in Java.
As mentioned in the comments, you might be overestimating data size. Let's do the math. 100k users at 30 books/day at, say, 30 bytes per record.
(100_000 * 30 * 30) / 1_000_000 # => 90 megabytes per day
Even if you add index size and some amount of overhead, this is still a few orders of magnitude lower than "many gigabytes per day".
I have created four TM1 cubes: Rate for hour, Hours, Rate of exchange and Revenue.
In first one, user enters rates(costs) in different currencies.
In second one, user enters customer hours (for example, how much time customer consultation took).
In third, user enters rate of exchange for every currency.
In Revenue, based on data in previous cubes, I calculate all revenue in Euros.
The problem is when user enters same rate in more than one currency. Then revenue in Revenue cube is bigger than it should be.
My question: is there a way to prevent users from entering rates in more than one currency? All approaches I tried ends up with circular reference error.
Your question is almost impossible to answer in specific terms because you've provided no specific details of your cubes, dimensions, elements or rules.
In general terms, however... TM1 is not a relational database and other than picklists has few input restrictions. There are usually at least a couple of ways that you can work around that, though. In this case I assume (again, in the absence of specifics) that the relevant dimension in the first cube has an input element for each currency.
Instead of that you could have two input elements; one for the amount, and another for the currency code (regulated by picklist). Your rule in the Revenue cube then evaluates the relevant currency element by looking at the currency code input. That will allow it to look up the relevant exchange rate from the third cube via a DB() function. That rate is multiplied by the work rate that has been entered into the first cube and the hours entered into the second cube to calculate the revenue.
I got this situation. Logic is customer will be given credit sale and they will repay money in installments. I need to store this details about products, qty and the amounts they are giving in installments.
In dashboard i need to show all customers with name total sale amount, paid amount and balance amount.
Approach i thought
tblCredit = Stores as rows for all the time they pay amount
(e.g) shan(Name), paper (product), 1500 (qty) , 2000 (Price), 100
(Debit) { initial purchase) }
shan (Name), -, -, -, 200 (Debit)
In query filter by name and sum(Price) - Sum(Debit amount) will give
balance
But this approach once the data grows is this aggregation will be trouble some ?
Is it possible like caching the aggregated result with timestamps or
something like that and update that at every operation when we are
inserting data in that table and show result from that ?
Note
Data growth rate will be high.
I am very new to designing.
Please guide me the best approach to handle this.
Update
Apart from dashboard i need to show report when users clicks report to know how much credit given for whom. So in any case i need a optimized query and logic and design to handle this.
Usually a dashboard do not need to get the data in real time. You may think of using data snapshot (schedule data insert after your aggregation) rather than maintaining a summary table update by different types of sales transactions, which is difficult in maintaining the integrity especially handling back-day process.
How do I design the database to calculate the account balance?
1) Currently I calculate the account balance from the transaction table
In my transaction table I have "description" and "amount" etc..
I would then add up all "amount" values and that would work out the user's account balance.
I showed this to my friend and he said that is not a good solution, when my database grows its going to slow down???? He said I should create separate table to store the calculated account balance. If did this, I will have to maintain two tables, and its risky, the account balance table could go out of sync.
Any suggestion?
EDIT: OPTION 2: should I add an extra column to my transaction tables "Balance".
now I do not need to go through many rows of data to perform my calculation.
Example
John buys $100 credit, he debt $60, he then adds $200 credit.
Amount $100, Balance $100.
Amount -$60, Balance $40.
Amount $200, Balance $240.
An age-old problem that has never been elegantly resolved.
All the banking packages I've worked with store the balance with the account entity. Calculating it on the fly from movement history is unthinkable.
The right way is:
The movement table has an 'opening
balance' transaction for each and every account. You'll need
this in a few year's time when you
need to move old movements out of the
active movement table to a history
table.
The account entity has a balance
field
There is a trigger on the movement
table which updates the account
balances for the credited and debited accounts. Obviously, it has commitment
control. If you can't have a trigger, then there needs to be a unique module which writes movements under commitment control
You have a 'safety net' program you
can run offline, which re-calculates
all the balances and displays (and
optionally corrects) erroneous
balances. This is very useful for
testing.
Some systems store all movements as positive numbers, and express the credit/debit by inverting the from/to fields or with a flag. Personally, I prefer a credit field, a debit field and a signed amount, this makes reversals much easier to follow.
Notice that these methods applies both to cash and securities.
Securities transactions can be much trickier, especially for corporate actions, you will need to accommodate a single transaction that updates one or more buyer and seller cash balances, their security position balances and possibly the broker/depository.
You should store the current account balance and keep it up to date at all times. The transaction table is just a record of what has happened in the past and shouldn't be used at a high frequency just to fetch the current balance. Consider that many queries don't just want balances, they want to filter, sort and group by them, etc. The performance penalty of summing every transaction you've ever created in the middle of complex queries would cripple even a database of modest size.
All updates to this pair of tables should be in a transaction and should ensure that either everything remains in sync (and the account never overdraws past its limit) or the transaction rolls back. As an extra measure, you could run audit queries that check this periodically.
This is a database design I got with only one table for just storing a history of operations/transactions. Currently working as charm on many small projects.
This doesn't replace a specific design. This is a generic solution that could fit most of the apps.
id:int
standard row id
operation_type:int
operation type. pay, collect, interest, etc
source_type:int
from where the operation proceeds.
target table or category: user, bank, provider, etc
source_id:int
id of the source in the database
target_type:int
to what the operation is applied.
target table or category: user, bank, provider, etc
target_id:int
id of the target in the database
amount:decimal(19,2 signed)
price value positive or negative to by summed
account_balance:decimal(19,2 signed)
resulting balance
extra_value_a:decimal(19,2 signed) [this was the most versatile option without using string storage]
you can store an additional number: interest percentage, a discount, a reduction, etc.
created_at:timestamp
For the source_type and target_type it would be better to use an enum or tables appart.
If you want a particular balance you can just query the last operation sorted by created_at descending limit to 1. You can query by source, target, operation_type, etc.
For better performance it's recommended to store the current balance in the required target object.
Of course you need to store your current balance with each row, otherwise it is too slow. To simplify development, you can use constraints, so that you dont need triggers and periodic checks of data integrity. I described it here Denormalizing to enforce business rules: Running Totals
A common solution to this problem is to maintain a (say) monthly opening balance in a snapshot schema. Calculating the current balance can be done by adding transactional data for the month to the monthly opening balance. This approach is often taken in accounts packages, particularly where you might have currency conversion and revaluations.
If you have problems with data volume you can archive off the older balances.
Also, the balances can be useful for reporting if you don't have a dedicated external data warehouse or a management reporting facility on the system.
Your friend is wrong and you are right, and I would advise you don't change things now.
If your db ever goes slow because of this, and after you have verified all the rest (proper indexing), some denormalisation may be of use.
You could then put a BalanceAtStartOfYear field in the Accounts table, and summarize only this year records (or any similar approach).
But I would certainly not recommend this approach upfront.
Here is would like to suggest you how can you store your opening balance with a very simple way:-
Create a trigger function on the transaction table to be called only after update or insert.
Create a column having name in the master table of account naming Opening Balance.
save your opening balance in array in the opening balance column in master table.
you even not need to use server side language use this store array simply you can use database array functions like available in PostgreSQL.
when you want to recalculate you opening balance in array just group your transaction table with array function and update the whole data in the master table.
I have done this in PostgreSQL and working fine.
over the period of time when your transaction table will become heavy then you can partition for your transaction table on the base of date to speed up the performance.
this approach is very easy and need not to use any extra table which can slow performance if joining table because lesser table in the joining will give you high performance.
My approach is to store the debits in a debit column, credit in the credit column and when fetching the data create two arrays, debit and credit array. Then keep appending the selected data to the array and do this for python:
def real_insert(arr, index, value):
try:
arr[index] = value
except IndexError:
arr.insert(index, value)
def add_array(args=[], index=0):
total = 0
if index:
for a in args[: index]:
total += a
else:
for a in args:
total += a
return total
then
for n in range(0, len(array), 1):
self.store.clear()
self.store.append([str(array[n][4])])
real_insert(self.row_id, n, array[n][0])
real_insert(self.debit_array, n, array[n][7])
real_insert(self.credit_array, n, array[n][8])
if self.category in ["Assets", "Expenses"]:
balance = add_array(self.debit_array) - add_array(self.credit_array)
else:
balance = add_array(self.credit_array) - add_array(self.debit_array)
Simple answer: Do all three.
Store the current balance; and in each transaction store the movement and a snapshot of the current balance at that point in time. This would give something extra to reconcile in any audit.
I've never worked on core banking systems, but I have worked on investment management systems, and in my experience this is how It's done.