Periodic snapshot fact table - Design question

Periodic snapshot fact table - Design question - database

I'm working on the design of a new periodic snapshot fact table. I'm looking into health insurance claims and the amount of money people owe to the insurance company and the amount they've already paid. Data in the table will look like this.
CLAIM_ID TIME_KEY AMOUNT_OWED PAID
123 31.1.2000 1000 0
123 28.2.2000 900 100
123 31.3.2000 800 200
123 30.4.2000 0 1000
123 31.5.2000 0 1000
123 30.6.2000 0 1000
123 31.7.2000 0 1000
123 31.8.2000 0 1000
...
As you can see after 30.4.2000 it doesn't make sense to insert new data for claim_id 123 as it no longer changes (there is a reasonable degree of certainty this won't happen). Is it a good idea to stop inserting data for this claim or should I do so till the end of time :)?
I'm mainly concerned about sticking to best practices when designing Data Warehouse tables.
Thanks for any answer!

just a few thoughts...
Unless you can have multiple payments in a day against a claim (and potentially other transactions e.g. interest that increases the amount owed), then what you have shown is not really a snapshot fact, it is a transactional fact. The normal example given is a bank account where you have multiple in/out transactions per day and then a snapshot of the end-of-day (or end-of-month) position. Obviously I don't know your business model but it seems unlikely that there would be multiple transactions per day against a single claim
If there have been no changes to a claim since the last fact record was created there seems little point creating a new fact record

Typically you choose a periodic snapshot if you have
a) a large number of transactions and
b) you need an effective access to the data at some point of time (end of the month in your case)
If you have say 50 claim transactions per month and the claim is active one year on average, you will profit from this design even if you will hold the inactive claims for 50 years (which you will probably will not do;)
Your doubts suggest that you have not so much transactions per claim life cycle. It that case you should consider a fact table storing each transaction.
You will have definitively no overhead for inactive claims, but to get a snapshot information at a specific time you'll have to read the whole table.
On the contrary the periodic snapshot is typically partitioned on the snapshot time, so the access is very affective.
get no free lunch with saving space and an effective access.

Related

Choosing proper database in AWS when all items must be read from the table

I have an AWS application where DynamoDB is used for most data storage and it works well for most cases. I would like to ask you about one particular case where I feel DynamoDB might not be the best option.
There is a simple table with customers. Each customer can collect virtual coins so each customer has a balance attribute. The balance is managed by 3rd party service keeping up-to-date value and the balance attribute in my table is just a cached version of it. The 3rd party service requires its own id of the customer as an input so customers table contains also this externalId attribute which is used to query the balance.
I need to run the following process once per day:
Update the balance attribute for all customers in a database.
Find all customers with the balance greater than some specified constant value. They need to be sorted by the balance.
Perform some processing for all of the customers - the processing must be performed in proper order - starting from the customer with the greatest balance in descending order (by balance).
Question: which database is the most suitable one for this use case?
My analysis:
In terms of costs it looks to be quite similar, i.e. paying for Compute Units in case of DynamoDB vs paying for hours of micro instances in case of RDS. Not sure though if micro RDS instance is enough for this purpose - I'm going to check it but I guess it should be enough.
In terms of performance - I'm not sure here. It's something I will need to check but wanted to ask you here beforehand. Some analysis from my side:
It involves two scan operations in the case of DynamoDB which
looks like something I really don't want to have. The first scan can be limited to externalId attribute, then balances are queried from 3rd party service and updated in the table. The second scan requires a range key defined for balance attribute to return customers sorted by the balance.
I'm not convinced that any kind of indexes can help here. Basically, there won't be too many read operations of the balance - sometimes it will need to be queried for a single customer using its primary key. The number of reads won't be much greater than number of writes so indexes may slow the process down.
Additional assumptions in case they matter:
There are ca. 500 000 customers in the database, the average size of a single customer is 200 bytes. So the total size of the customers in the database is 100 MB.
I need to repeat step 1 from the above procedure (update the balance of all customers) several times during the day (ca. 20-30 times per day) but the necessity to retrieve sorted data is only once per day.
There is only one application (and one instance of the application) performing the above procedure. Besides that, I need to handle simple CRUD which can read/update other attributes of the customers.

I think people are overly afraid of DynamoDB scan operations. They're bad if used for regular queries but for once-in-a-while bulk operations they're not so bad.
How much does it cost to scan a 100 MB table? That's 25,000 4KB blocks. If doing eventually consistent that's 12,250 read units. If we assume the cost is $0.25 per million (On Demand mode) that's 12,250/1,000,000*$0.25 = $0.003 per full table scan. Want to do it 30 times per day? Costs you less than a dime a day.
The thing to consider is the cost of updating every item in the database. That's 500,000 write units, which if in On Demand at $1.25 per million will be about $0.63 per full table update.
If you can go Provisioned for that duration it'll be cheaper.
Regarding performance, DynamoDB can scan a full table faster than any server-oriented database, because it's supported by potentially thousands of back-end servers operating in parallel. For example, you can do a parallel scan with up to a million segments, each with a client thread reading data in 1 MB chunks. If you write a single-threaded client doing a scan it won't be as fast. It's definitely possible to scan slowly, but it's also possible to scan at speeds that seem ludicrous.
If your table is 100 MB, was created in On Demand mode, has never hit a high water mark to auto-increase capacity (just the starter capacity), and you use a multi-threaded pull with 4+ segments, I predict you'll be done in low single digit seconds.

How to handle cash advance travel expenses in database?

I'm implementing a software solution for a company. As part of its business processes they give cash advances to its employees when they get on a business trip, let's say 1000 dollars. Those 1000 dollars are withdrawn from one of the cash registers available and the transaction is registered to the database as an expense so at the end of their turns cashiers are able to justify the 1k missing.
Then, sometimes the employee who travels spends just part of the money he was given, let's say 500 on gas, 200 on hotel and 100 on meals, so 800 out of the 1000 he was given.
So here's the situation I'm dealing with: those three expenses (gas, hotel and meals) need also to be registered individually when the employee comes back from his trip for two reasons:
To store the expense under the right expense concept so we can then query the database and see how much the company spends on every concept.
To match the money given to the employee as cash advance for his travel expenses with his actual expenses. So a total of four transactions would be performed: three expenses and one income, the 200 the employee didn't used, to match the original 1K.
Up to this point everything is fine regarding the cash advance transaction: 1000 dollars were given and 1000 dollars where justified, 800 in expenses and 200 as the money back. The problem is, the original 1000K still exist as an expense, so the total balance would be -600: 1800 spent and 200 income, which is obviously not correct.
I've thought of two alternatives:
Subtract the travel expenses (800) from the original cash advance expense (1000). This way, the final balance is right (0). But that would mean modifying an old expense entry which in turn will affect that days cash closing if for some reason it needs to be consulted in the future. I'm not sure if this is actually a bad thing to do since the final balance will match the actual money, it just feels wrong to mess with an old entry that will no longer represent what actually happened that day.
Treat cash advances as a special entity, not as an actual expense. This new entity will have the following fields: withdrawn_money, money_back and spent_money. When calculating the cash closing for that day the outcomes would be the sum of expenses plus the sum of cash advances withdrawn_money thus leaving the cash closing intact. Later when the actual expenses are presented they will go directly to the expenses table but at the same time being registered as a cash_advance spent_money new entry. This way we can keep under control how much of the withdrawn money is returned and at the same time save the actual expenses under the right expense concepts for future querying.
Cash advances not being considered as actual expenses make sense to me since the money is in employees hands so technically is not an expense yet.
Alternative 2 sounds better to me but I'm still trying to find the most appropriate implementation. It would be of great help to get opinions of more experienced developers and database designers on subjects like this.
Thanks in advance for your time and I would gladly clarify anything if I wasn't clear enough.

If you are familiar with the double-entry accounting this situation looks like this:
there is a transaction from account 501 (which refers to the money in cash) using the currency as a reference/article to account 422 (which refers to the company staff) using the employee as a reference/article
when the employee comes back:
if he has paid with credit card only and has not used the cash - you make a reverse transaction (from account 422 to account 501)
otherwise for each receipt/invoice he brings back - you make a transaction from account 422 (using the employee as a reference/article) to account 609 (which is for other expenses - and then referencing the appropriate expense)
if the employee does not have a receipt/invoice for some of the money - it stays in account 422 and on the next salary payment you make a smaller transaction for his salary and compensate with a transaction from account 422 to account 604 (which is the payroll) to clear the remainder

How to store total visits statistics for user history efficiently?

I'm maintaining a system where users create something called "books" that are accessed by other users.
I need a convenient (good performance) way to store events in database where users visit these books to later display graphs with statistics. The graphs need to demonstrate a history where the owner of the book can see which days in the week, and at which times there is more visiting activity (all over the months).
Using ERD (Entity-Relationship-Diagram), I can produce the following Conceptual Model:
At first the problem seems to be solved, as we have a very simple situation here. This will give me a table with 3 fields. One will be the occurrence of the visit event, and the other 2 will be foreign keys. One represents the user, while the other represents which book was visited. In short, every record in this table will be a visit:
However, thinking that a user can average about 10 to 30 book visits per day, and having a system with 100.000 users, in a single day this table can add many gigabytes of new records. I'm not the most experienced person in good database performance practices, but I'm pretty sure that this is not the solution.
Even though I do a cleanup on the database to delete old records, I need to keep a record history of the last 2 months of visits (at least).
I've been looking for a way to solve this for days, and I have not found anything yet. Could someone help me, please?
Thank you.
OBS: I'm using PostgreSQL 9.X, and the system is written in Java.

As mentioned in the comments, you might be overestimating data size. Let's do the math. 100k users at 30 books/day at, say, 30 bytes per record.
(100_000 * 30 * 30) / 1_000_000 # => 90 megabytes per day
Even if you add index size and some amount of overhead, this is still a few orders of magnitude lower than "many gigabytes per day".

How to save user state per entity efficiently?

TLDR: How to save a state of each user on each entity of my application avoiding cartesian-product?
Say I have 10K entities and 10K users, and each user has a status per entity. In order to manage my users it feels like I'm going to have to save an entry per user on some cartesian table, which will have 100 million entries, and that seems irrational..
I've been thinking maybe this table could be sorted by the user's primary key, so my queries could be more efficient, but it still seems bad way to handle this situation.
Any solution would be highly appreciated, whether it's with the DB choice itself - relational or not, different ERDs or anything else
Thanks!

Firstly: Based on you question, all users can have status per each entity and all of this information should be saved. You have 100 million entries with same meta data (entityID, userID, statusID). So you have Cartesian-Product here between User and Entity. There is not any ERD improvements by Data Modeling Patterns.
Secondly: You have 100 million entries as MAXIMUM. I think your data is not BIG enough and does not cover 3V's of BIG Data (Volume: amount of data, Velocity: speed of data in and out, Variety: range of data types and sources). You can handle it by relational DBMS's. However, if you have more than 100 million entries (for example 100 million entries per day or per week), you should use BIG Data technologies.
Thirdly: To have maximum performance in queries, you can use some In-Memory DBMS's. You have 100 million entries with 3 Long (or BigInt) IDs. So you approximately need maximum 100 * 3 * 8 MB (2400 MB=2.4 GB) memory. (I'm sure that you have Big Server to handle 10k Users with 10k Entities.)

Which data store is best for my scenario

I'm working on an application that involves very high execution of update / select queries in the database.
I have a base table (A) which will have about 500 records for an entity for a day. And for every user in the system, a variation of this entity is created based on some of the preferences of the user and they are stored in another table (B). This is done by a cron job that runs at midnight everyday.
So if there are 10,000 users and 500 records in table A, there will be 5M records in table B for that day. I always keep data for one day in these tables and at midnight I archive historical data to HBase. This setup is working fine and I'm having no performance issues so far.
There has been some change in the business requirements lately and now some attributes in base table A ( for 15 - 20 records) will change every 20 seconds and based on that I have to recalculate some values for all of those variation records in table B for all users. Even though only 20 master records change, I need to do recalculation and update 200,000 user records which takes more than 20 seconds and by then the next update occurs eventually resulting in all Select queries getting queued up. I'm getting about 3 get request / 5 seconds from online users which results in 6-9 Select queries. To respond to an api request, I always use the fields in table B.
I can buy more processing power and solve this situation but I'm interested in having a properly scaled system which can handle even a million users.
Can anybody here suggest a better alternative? Does nosql + relational database help me here ? Are there any platforms / datastores which will let me update data frequently without locking and at the same time give me the flexibility of running select queries on various fields in an entity ?
Cheers
Jugs

I recommend looking at an in memory DBMS that fully implements MVCC, to eliminate blocking issues. If your application is currently using SQL, then there's no reason to move away from that to nosql. The performance requirements you describe can certainly be met by an in memory SQL-capable DBMS.

What I understand from your saying you are updating 200K records for every 20 sec. Then like in 10min you will update almost all of your data. In that case why are you writing those state to database if that is so frequently updated. I don't know anything about your requirements but why don't you just calculate it on demand using data from table A?