an issue due to miscalculation with the monthly revenue - database

We have a data warehouse which contain dimensional table and sales fact table. One day, there is an issue due to miscalculation with the monthly revenue. We found the root cause is due to missing one customer in source file. How we can we improve the system to handle this issue and make the total monthly revenue still correct? Please give your assumption for the existing system and the solution to handle missing customer data.

The missing revenue should be associated with the "Unknown" customer so it is surfaced in the model and revealed for the source to fix it. The data warehouse should NEVER FIX BAD DATA, just reveal it.

Related

How can I find my Snowflake bill within Snowflake?

Usually I get my Snowflake invoices through email, but I'd like to track my consumption within Snowflake.
I've found a way to find my usage data from the console, but not mapped to actual consumption in dollars.
Any ideas?
Check the new org tables check the new org tables REMAINING_BALANCE_DAILY and USAGE_IN_CURRENCY_DAILY:
https://docs.snowflake.com/en/sql-reference/organization-usage/usage_in_currency_daily.html
https://docs.snowflake.com/en/sql-reference/organization-usage/remaining_balance_daily.html
Some notes:
The contract items view should show the consumption-related products invoiced for, and the usage_in_currency view shows all the information in the monthly usage statement.
The daily usage numbers in org_usage may not be finalized. These numbers can be refreshed for the past several days, especially storage usage.
Once a month closes the data should never change and should tie exactly to the usage statements.
Also check the views RATE_SHEET_DAILY and CONTRACT_ITEMS: https://docs.snowflake.com/en/sql-reference/organization-usage.html#organization-usage-views

Data inaccuracy

In my new job we have an old program that is 10 to 15 years old I think, but is still used. I work on renewing the system which have a major problem in the old data.
The program is part of a payment system. It allows the original payment to be split at the paying time.
When it splits the payment it updates the original record and keeps the original date on the new record. It keeps the original value of the last operation in a separate field.
1000$ original split into 2 500$ --> by adding new 500$ record and updating the original into 500$ payment keeping 1000$ as the original.
500$ split into 300$ , 200$ --> by adding new 200$ record and updating the original row into 300$ payment, now the original is updated to 500$ instead of 1000$.
and so on.
The following image contains a example case based on a real case. with two original payments 1000 and 600.
Whoever made the program did not use transactions, so some times the new record is not added (that's how the problem was discovered, but too late 15 years too late).
How can I find the affected customers in an 4.5 million records?
Is there a way to find the real original amount from the original field in the image? (I know that the answer might be no).
The database is oracle and the program was developed on oracle forms.
Thank you.
edit : a step by step example in a spreadsheet
https://docs.google.com/spreadsheets/d/1I9jOlCeiVuGdNlgXpiF_-Ic0e-cqaalrpUCJIUM5oAk/edit?usp=sharing
The problem is that the date field does not keep time only date. if the customer made several translations in the same day the error becomes hard to detect. it can only be detected if the customer made only one transaction in the day, even then it has to be viewed case by case. That's hard for years of work. Unfortunately, maybe they will have to suffer a loss for bad programming.
I will provide all the table fields tomorrow for better understanding.
Thank you for the replies.

Data modeling with the goal to get the best performance in Oracle and SQL Server

I have a question about how I can model my stock database in order to get the best performance possible.
In SQL Server or in the Oracle, each update executed generates a little lock.
I'd like to know what's the best solution that you could tell me
Solution 1: create a product stock table with quantity column and for each input or output execute a SQL update against this column
Solution 2: create a table for product stock movement where for each input I would execute an insert with a positive quantity and for each output I would execute an insert with a negative quantity.
At the end of the day, I would execute an process for update the quantity of the stock products with the "sum" result of the product stock movement table
After that, I would delete all records in the product stock movement table
With solution 1, I would have the advantage that execute an simple select to get the product stock quantity but during the day I would have the disadvantage that have many locks due many quantity updates regarding output sold products
With solution 2, I'd have the disadvantage when, I will need to get the product stock quantity, I'd need to make a query with a join with product stock movement table and make a sum in all inputs and outputs of the consulted product, but in this way, during all day I wouldn't have any locks
What do you think about that two solutions presented?
Is it a good practice to make the modeling described in solution 2?
Thank you so much
A lot of assumptions are made here with a potential solution to a "hypothetical" problem. You don't have numbers to confidently state either of these design will lead to a problem. We don't know your hardware specs either.
Do some due diligence first and figure out how much volume are you dealing with on a daily or monthly basis etc along with how much read/write is going to happen any given time (min/hour)? Once you have these numbers (even if they are not accurate you should get some sense of activity) run some benchmarks on the actual instance that's hosting the database (or a comparable one) for both of your solution and see for yourself what performs better.
Repeat the exercise with 3x or 5x more read/write and compare again so you are covered for the growth in future.
Decisions made with a bunch of assumptions leads to very opinionated design with always results in poor choice. Always use data to drive your decisions and validate your assumption.
PS: talking from experience here. Given we are dealing with a very large transactions, we generally have a summary table and a detail table and use triggers to update count in summary table when new records get inserted in details etc.
Good luck.

Large number of entries: how to calculate quickly the total?

I am writing a rather large application that allows people to send text messages and emails. I will charge 7c per SMS and 2c per email sent. I will allow people to "recharge" their account. So, the end result is likely to be a database table with a few small entries like +100 and many, many entries like -0.02 and -0.07.
I need to check a person's balance immediately when they are trying to send an email or a message.
The obvious answer is to have cached "total" somewhere, and update it whenever something is added or taken out. However, as always in programming, there is more to it: what about monthly statements, where the balance needs to be carried forward from the previous month? My "intuitive" solution is to have two levels of cache: one for the current month, and one entry for each month (or billing period) with three entries:
The total added
The total taken out
The balance to that point
Are there better, established ways to deal with this problem?
Largely depends on the RDBMS.
If it were SQL Server, one solution is to create an Indexed view (or views) to automatically incrementally calculate and hold the aggregated values.
Another solution is to use triggers to aggregate whenever a row is inserted at the finest granularity of detail.

Deriving and saving the historical values into a separate table, or calculate the historical values from the existing data only when they're needed?

tl;dr general question about handling database data and design:
Is it ever acceptable/are there any downsides to derive data from other data at some point in time, and then store that derived data into a separate table in order to keep a history of values at that certain time, OR, should you never store data that is derived from other data, and instead derive the required data from the existing data only when you need it?
My specific scenario:
We have a database where we record peoples' vacation days and vacation day statuses. We track how many days they have left, how many days they've taken, and things like that.
One design requirement has changed and now asks that I be able to show how many days a person had left on December 31st of any given year. So I need to be able to say, "Bob had 14 days left on December 31st, 2010".
We could do this two ways I see:
A SQL Server Agent job that, on December 31st, captures the days remaining for everyone at that time, and inserts them into a table like "YearEndHistories", which would have your EmployeeID, Year, and DaysRemaining at that time.
We don't keep a YearEndHistories table, but instead if we want to find out the amount of days possessed at a certain time, we loop through all vacations added and subtracted that exist UP TO that specific time.
I like the feeling of certainty that comes with #1 --- the recorded values would be reviewed by administration, and there would be no arguing or possibility about that number changing. With #2, I like the efficiency --- one less table to maintain, and there's no derived data present in the actual tables. But I have a weird fear about some unseen bug slipping by and peoples' historical value calculation start getting screwed up or something. In 2020 I don't want to deal with, "I ended 2012 with 9.5 days, not 9.0! Where did my half day go?!"
One thing we have decided on is that it will not be possible to modify values in previous years. That means it will never be possible to go back to the previous calendar year and add a vacation day or anything like that. The value at the end of the year is THE value, regardless of whether or not there was a mistake in the past. If a mistake is discovered, it will be balanced out by rewarding or subtracting vacation time in the current year.
Yes, it is acceptable, especially if the calculation is complex or frequently called, or doesn't change very often (eg: A high score table in a game - it's viewed very often, but the content only changes on the increasingly rare occasions when a player does very well).
As a general rule, I would normalise the data as far as possible, then add in derived fields or tables where necessary for performance reasons.
In your situation, the calculation seems relatively simple - a sum of employee vacation days granted - days taken, but that's up to you.
As an aside, I would encourage you to get out of thinking about "loops" when data is concerned - try to think about the data as a whole, as a set. Something like
SELECT StaffID, sum(Vacation)
from
(
SELECT StaffID, Sum(VacationAllocated) as Vacation
from Allocations
where AllocationDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
union
SELECT StaffID, -Count(distinct HolidayDate)
from HolidayTaken
where HolidayDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
) totals
group by StaffID
Derived data seems to me like a transitive dependency, which is avoided in normalisation.
That's the general rule.
In your case I would go for #1, which gives you a better "auditability", without performance penalty.

Resources