I got this situation. Logic is customer will be given credit sale and they will repay money in installments. I need to store this details about products, qty and the amounts they are giving in installments.
In dashboard i need to show all customers with name total sale amount, paid amount and balance amount.
Approach i thought
tblCredit = Stores as rows for all the time they pay amount
(e.g) shan(Name), paper (product), 1500 (qty) , 2000 (Price), 100
(Debit) { initial purchase) }
shan (Name), -, -, -, 200 (Debit)
In query filter by name and sum(Price) - Sum(Debit amount) will give
balance
But this approach once the data grows is this aggregation will be trouble some ?
Is it possible like caching the aggregated result with timestamps or
something like that and update that at every operation when we are
inserting data in that table and show result from that ?
Note
Data growth rate will be high.
I am very new to designing.
Please guide me the best approach to handle this.
Update
Apart from dashboard i need to show report when users clicks report to know how much credit given for whom. So in any case i need a optimized query and logic and design to handle this.
Usually a dashboard do not need to get the data in real time. You may think of using data snapshot (schedule data insert after your aggregation) rather than maintaining a summary table update by different types of sales transactions, which is difficult in maintaining the integrity especially handling back-day process.
Related
Usually I get my Snowflake invoices through email, but I'd like to track my consumption within Snowflake.
I've found a way to find my usage data from the console, but not mapped to actual consumption in dollars.
Any ideas?
Check the new org tables check the new org tables REMAINING_BALANCE_DAILY and USAGE_IN_CURRENCY_DAILY:
https://docs.snowflake.com/en/sql-reference/organization-usage/usage_in_currency_daily.html
https://docs.snowflake.com/en/sql-reference/organization-usage/remaining_balance_daily.html
Some notes:
The contract items view should show the consumption-related products invoiced for, and the usage_in_currency view shows all the information in the monthly usage statement.
The daily usage numbers in org_usage may not be finalized. These numbers can be refreshed for the past several days, especially storage usage.
Once a month closes the data should never change and should tie exactly to the usage statements.
Also check the views RATE_SHEET_DAILY and CONTRACT_ITEMS: https://docs.snowflake.com/en/sql-reference/organization-usage.html#organization-usage-views
I have a question about how I can model my stock database in order to get the best performance possible.
In SQL Server or in the Oracle, each update executed generates a little lock.
I'd like to know what's the best solution that you could tell me
Solution 1: create a product stock table with quantity column and for each input or output execute a SQL update against this column
Solution 2: create a table for product stock movement where for each input I would execute an insert with a positive quantity and for each output I would execute an insert with a negative quantity.
At the end of the day, I would execute an process for update the quantity of the stock products with the "sum" result of the product stock movement table
After that, I would delete all records in the product stock movement table
With solution 1, I would have the advantage that execute an simple select to get the product stock quantity but during the day I would have the disadvantage that have many locks due many quantity updates regarding output sold products
With solution 2, I'd have the disadvantage when, I will need to get the product stock quantity, I'd need to make a query with a join with product stock movement table and make a sum in all inputs and outputs of the consulted product, but in this way, during all day I wouldn't have any locks
What do you think about that two solutions presented?
Is it a good practice to make the modeling described in solution 2?
Thank you so much
A lot of assumptions are made here with a potential solution to a "hypothetical" problem. You don't have numbers to confidently state either of these design will lead to a problem. We don't know your hardware specs either.
Do some due diligence first and figure out how much volume are you dealing with on a daily or monthly basis etc along with how much read/write is going to happen any given time (min/hour)? Once you have these numbers (even if they are not accurate you should get some sense of activity) run some benchmarks on the actual instance that's hosting the database (or a comparable one) for both of your solution and see for yourself what performs better.
Repeat the exercise with 3x or 5x more read/write and compare again so you are covered for the growth in future.
Decisions made with a bunch of assumptions leads to very opinionated design with always results in poor choice. Always use data to drive your decisions and validate your assumption.
PS: talking from experience here. Given we are dealing with a very large transactions, we generally have a summary table and a detail table and use triggers to update count in summary table when new records get inserted in details etc.
Good luck.
I am trying to come up with a theoretical solution to an NxN problem for data aggregation and storage. As an example I have a huge amount of data that comes in via a stream. The stream sends the data in points. Each point has 5 dimensions:
Location
Date
Time
Name
Statistics
This data then needs to be aggregated and stored to allow another user to come along and query the data for both location and time. The user should be able to query like the following (pseudo-code):
Show me aggregated statistics for Location 1,2,3,4,....N between Dates 01/01/2011 and 01/03/2011 between times 11am and 4pm
Unfortunately due to the scale of the data it is not possible to aggregate all this data from the points on the fly and so aggregation prior to this needs to be done. As you can see though there are multiple dimensions that the data could be aggregated on.
They can query for any number of days or locations and so finding all the combinations would require huge pre-aggregation:
Record for Locations 1 Today
Record for Locations 1,2 Today
Record for Locations 1,3 Today
Record for Locations 1,2,3 Today
etc... up to N
Preprocessing all of these combinations prior to querying could result in an amount of precessing that is not viable. If we have 200 different locations then we have 2^200 combinations which would be nearly impossible to precompute in any reasonable amount of time.
I did think about creating records on 1 dimension and then merging could be done on the fly when requested, but this would also take time at scale.
Questions:
How should I go about choosing the right dimension and/or combination of dimensions given that the user is as likely to query on all dimensions?
Are there any case studies I could refer to, books I could read or anything else you can think of that would help?
Thank you for your time.
EDIT 1
When I say aggregating the data together I mean combining the statistics and name (dimensions 4 & 5) for the other dimensions. So for example if I request data for Locations 1,2,3,4..N then I must merge the statistics and counts of name together for those N Locations before serving it up to the user.
Similarly if I request the data for dates 01/01/2015 - 01/12/2015 then I must aggregate all data between those periods (by adding summing name/statistics).
Finally If I ask for data between dates 01/01/2015 - 01/12/2015 for Locations 1,2,3,4..N then I must aggregate all data between those dates for all those locations.
For the sake of this example lets say that going through statistics requires some sort of nested loop and does not scale well especially on the fly.
Try a time-series database!
From your description it seems that your data is a time-series dataset.
The user seems to be mostly concerned about the time when querying and after selecting a time frame, the user will refine the results by additional conditions.
With this in mind, I suggest you to try a time-series database like InfluxDB or OpenTSD.
For example, Influx provides a query language that is capable of handling queries like the following, which comes quite close to what you are trying to achieve:
SELECT count(location) FROM events
WHERE time > '2013-08-12 22:32:01.232' AND time < '2013-08-13'
GROUP BY time(10m);
I am not sure what you mean by scale, but the time-series DBs have been designed to be fast for lots of data points.
I'd suggest to definitely give them a try before rolling your own solution!
Denormalization is a means of addressing performance or scalability in relational database.
IMO having some new tables to hold aggregated data and using them for reporting will help you.
I have a huge amount of data that comes in via a stream. The stream
sends the data in points.
There will be multiple ways to achieve denormalization in the case:
Adding a new parallel endpoint for data aggregation functionality in streaming
level
Scheduling a job to aggregate data in DBMS level.
Using DBMS triggering mechanism (less efficient)
In an ideal scenario when a message reaches the streaming level there will be two copies of data message containing location, date, time, name, statistics dimensions, being dispatched for processing, one goes for OLTP(current application logic) second will goes for an OLAP(BI) process.
The BI process will create denormalized aggregated structures for reporting.
I will suggest having aggregated data record per location, date group.
So end-user will query preprossed data that wont need heavy recalculations, having some acceptable inaccuracy.
How should I go about choosing the right dimension and/or combination
of dimensions given that the user is as likely to query on all
dimensions?
That will depends on your application logic. If possible limit the user for predefined queries that can be assigned values by the user(like for dates from 01/01/2015 to 01/12/2015). In more complex systems using a report generator above the BI warehouse will be an option.
I'd recommend Kimball's The Data Warehouse ETL Toolkit.
You can at least reduce Date and Time to a single dimension, and pre-aggregate your data based on your minimum granularity, e.g. 1-second or 1-minute resolution. It could be useful to cache and chunk your incoming stream for the same resolution, e.g. append totals to the datastore every second instead of updating for every point.
What's the size and likelyhood of change of the name and location domains? Is there any relation between them? You said that location could be as many as 200. I'm thinking that if name is a very small set and unlikely to change, you could hold counts of names in per-name columns in a single record, reducing the scale of the table to 1 row per location per unit of time.
you have a lot of datas. It will take a lot of time with all methods due to the amount of datas you're trying to parse.
I have two methods to give.
First one is a brutal one, you probably thought off:
id | location | date | time | name | statistics
0 | blablabl | blab | blbl | blab | blablablab
1 | blablabl | blab | blbl | blab | blablablab
ect.
With this one, you can easily parse and get elements, they are all in the same table, but the parsing is long and the table is enormous.
Second one is better I think:
Multiple tables:
id | location
0 | blablabl
id | date
0 | blab
id | time
0 | blab
id | name
0 | blab
id | statistics
0 | blablablab
With this you could parse (a lot) faster, getting the IDs and then taking all the needed informations.
It also allow you to preparse all the datas:
You can have the locations sorted by location, the time sorted by time, the name sorted by alphabet, ect, because we don't care about how the ID's are mixed:
If the id's are 1 2 3 or 1 3 2, no one actually care, and you would go a lot faster with parsing if your datas are already parsed in their respective tables.
So, if you use the second method I gave: At the moment where you receive a point of data, give an ID to each of his columns:
You receive:
London 12/12/12 02:23:32 donut verygoodstatsblablabla
You add the ID to each part of this and go parse them in their respective columns:
42 | London ==> goes with London location in the location table
42 | 12/12/12 ==> goes with 12/12/12 dates in the date table
42 | ...
With this, you want to get all the London datas, they are all side by side, you just have to take all the ids, and get the other datas with them. If you want to take all the datas between 11/11/11 and 12/12/12, they are all side by side, you just have to take the ids ect..
Hope I helped, sorry for my poor english.
You should check out Apache Flume and Hadoop
http://hortonworks.com/hadoop/flume/#tutorials
The flume agent can be used to capture and aggregate the data into HDFS, and you can scale this as needed. Once it is in HDFS there are many options to visualize and even use map reduce or elastic search to view the data sets you are looking for in the examples provided.
I have worked with a point-of-sale database with hundred thousand products and ten thousand stores (typically week-level aggregated sales but also receipt-level stuff for basket analysis, cross sales etc.). I would suggest you to have a look at these:
Amazon Redshift, highly scalable and relatively simple to get started, cost-efficient
Microsoft Columnstore Indexes, compresses data and has familiar SQL interface, quite expensive (1 year reserved instance r3.2xlarge at AWS is about 37.000 USD), no experience on how it scales within a cluster
ElasticSearch is my personal favourite, highly scalable, very efficient searches via inverted indexes, nice aggregation framework, no license fees, has its own query language but simple queries are simple to express
In my experiments ElasticSearch was faster than Microsoft's column store or clustered index tables for small and medium-size queries by 20 - 50% on same hardware. To have fast response times you must have sufficient amount of RAM to have necessary data structures loaded in-memory.
I know I'm missing many other DB engines and platforms but I am most familiar with these. I have also used Apache Spark but not in data aggregation context but for distributed mathematical model training.
Is there really likely to be a way of doing this without brute forcing it in some way?
I'm only familiar with relational databases, and I think that the only real way to tackle this is with a flat table as suggested before i.e. all your datapoints as fields in a single table. I guess that you just have to decide how to do this, and how to optimize it.
Unless you have to maintain 100% to the single record accuracy, then I think the question really needs to be, what can we throw away.
I think my approach would be to:
Work out what the smallest time fragment would be and quantise the time domain on that. e.g. each analyseable record is 15 minutes long.
Collect raw records together into a raw table as they come in, but as the quantising window passes, summarize the rows into the analytical table (for the 15 minute window).
Deletion of old raw records can be done by a less time-sensitive routine.
Location looks like a restricted set, so use a table to convert these to integers.
Index all the columns in the summary table.
Run queries.
Obviously I'm betting that quantising the time domain in this way is acceptable. You could supply interactive drill-down by querying back onto the raw data by time domain too, but that would still be slow.
Hope this helps.
Mark
I am writing a rather large application that allows people to send text messages and emails. I will charge 7c per SMS and 2c per email sent. I will allow people to "recharge" their account. So, the end result is likely to be a database table with a few small entries like +100 and many, many entries like -0.02 and -0.07.
I need to check a person's balance immediately when they are trying to send an email or a message.
The obvious answer is to have cached "total" somewhere, and update it whenever something is added or taken out. However, as always in programming, there is more to it: what about monthly statements, where the balance needs to be carried forward from the previous month? My "intuitive" solution is to have two levels of cache: one for the current month, and one entry for each month (or billing period) with three entries:
The total added
The total taken out
The balance to that point
Are there better, established ways to deal with this problem?
Largely depends on the RDBMS.
If it were SQL Server, one solution is to create an Indexed view (or views) to automatically incrementally calculate and hold the aggregated values.
Another solution is to use triggers to aggregate whenever a row is inserted at the finest granularity of detail.
How do I design the database to calculate the account balance?
1) Currently I calculate the account balance from the transaction table
In my transaction table I have "description" and "amount" etc..
I would then add up all "amount" values and that would work out the user's account balance.
I showed this to my friend and he said that is not a good solution, when my database grows its going to slow down???? He said I should create separate table to store the calculated account balance. If did this, I will have to maintain two tables, and its risky, the account balance table could go out of sync.
Any suggestion?
EDIT: OPTION 2: should I add an extra column to my transaction tables "Balance".
now I do not need to go through many rows of data to perform my calculation.
Example
John buys $100 credit, he debt $60, he then adds $200 credit.
Amount $100, Balance $100.
Amount -$60, Balance $40.
Amount $200, Balance $240.
An age-old problem that has never been elegantly resolved.
All the banking packages I've worked with store the balance with the account entity. Calculating it on the fly from movement history is unthinkable.
The right way is:
The movement table has an 'opening
balance' transaction for each and every account. You'll need
this in a few year's time when you
need to move old movements out of the
active movement table to a history
table.
The account entity has a balance
field
There is a trigger on the movement
table which updates the account
balances for the credited and debited accounts. Obviously, it has commitment
control. If you can't have a trigger, then there needs to be a unique module which writes movements under commitment control
You have a 'safety net' program you
can run offline, which re-calculates
all the balances and displays (and
optionally corrects) erroneous
balances. This is very useful for
testing.
Some systems store all movements as positive numbers, and express the credit/debit by inverting the from/to fields or with a flag. Personally, I prefer a credit field, a debit field and a signed amount, this makes reversals much easier to follow.
Notice that these methods applies both to cash and securities.
Securities transactions can be much trickier, especially for corporate actions, you will need to accommodate a single transaction that updates one or more buyer and seller cash balances, their security position balances and possibly the broker/depository.
You should store the current account balance and keep it up to date at all times. The transaction table is just a record of what has happened in the past and shouldn't be used at a high frequency just to fetch the current balance. Consider that many queries don't just want balances, they want to filter, sort and group by them, etc. The performance penalty of summing every transaction you've ever created in the middle of complex queries would cripple even a database of modest size.
All updates to this pair of tables should be in a transaction and should ensure that either everything remains in sync (and the account never overdraws past its limit) or the transaction rolls back. As an extra measure, you could run audit queries that check this periodically.
This is a database design I got with only one table for just storing a history of operations/transactions. Currently working as charm on many small projects.
This doesn't replace a specific design. This is a generic solution that could fit most of the apps.
id:int
standard row id
operation_type:int
operation type. pay, collect, interest, etc
source_type:int
from where the operation proceeds.
target table or category: user, bank, provider, etc
source_id:int
id of the source in the database
target_type:int
to what the operation is applied.
target table or category: user, bank, provider, etc
target_id:int
id of the target in the database
amount:decimal(19,2 signed)
price value positive or negative to by summed
account_balance:decimal(19,2 signed)
resulting balance
extra_value_a:decimal(19,2 signed) [this was the most versatile option without using string storage]
you can store an additional number: interest percentage, a discount, a reduction, etc.
created_at:timestamp
For the source_type and target_type it would be better to use an enum or tables appart.
If you want a particular balance you can just query the last operation sorted by created_at descending limit to 1. You can query by source, target, operation_type, etc.
For better performance it's recommended to store the current balance in the required target object.
Of course you need to store your current balance with each row, otherwise it is too slow. To simplify development, you can use constraints, so that you dont need triggers and periodic checks of data integrity. I described it here Denormalizing to enforce business rules: Running Totals
A common solution to this problem is to maintain a (say) monthly opening balance in a snapshot schema. Calculating the current balance can be done by adding transactional data for the month to the monthly opening balance. This approach is often taken in accounts packages, particularly where you might have currency conversion and revaluations.
If you have problems with data volume you can archive off the older balances.
Also, the balances can be useful for reporting if you don't have a dedicated external data warehouse or a management reporting facility on the system.
Your friend is wrong and you are right, and I would advise you don't change things now.
If your db ever goes slow because of this, and after you have verified all the rest (proper indexing), some denormalisation may be of use.
You could then put a BalanceAtStartOfYear field in the Accounts table, and summarize only this year records (or any similar approach).
But I would certainly not recommend this approach upfront.
Here is would like to suggest you how can you store your opening balance with a very simple way:-
Create a trigger function on the transaction table to be called only after update or insert.
Create a column having name in the master table of account naming Opening Balance.
save your opening balance in array in the opening balance column in master table.
you even not need to use server side language use this store array simply you can use database array functions like available in PostgreSQL.
when you want to recalculate you opening balance in array just group your transaction table with array function and update the whole data in the master table.
I have done this in PostgreSQL and working fine.
over the period of time when your transaction table will become heavy then you can partition for your transaction table on the base of date to speed up the performance.
this approach is very easy and need not to use any extra table which can slow performance if joining table because lesser table in the joining will give you high performance.
My approach is to store the debits in a debit column, credit in the credit column and when fetching the data create two arrays, debit and credit array. Then keep appending the selected data to the array and do this for python:
def real_insert(arr, index, value):
try:
arr[index] = value
except IndexError:
arr.insert(index, value)
def add_array(args=[], index=0):
total = 0
if index:
for a in args[: index]:
total += a
else:
for a in args:
total += a
return total
then
for n in range(0, len(array), 1):
self.store.clear()
self.store.append([str(array[n][4])])
real_insert(self.row_id, n, array[n][0])
real_insert(self.debit_array, n, array[n][7])
real_insert(self.credit_array, n, array[n][8])
if self.category in ["Assets", "Expenses"]:
balance = add_array(self.debit_array) - add_array(self.credit_array)
else:
balance = add_array(self.credit_array) - add_array(self.debit_array)
Simple answer: Do all three.
Store the current balance; and in each transaction store the movement and a snapshot of the current balance at that point in time. This would give something extra to reconcile in any audit.
I've never worked on core banking systems, but I have worked on investment management systems, and in my experience this is how It's done.