Adding new aggregations to a time series database - database

I'm implementing a database system in postgresql to support fast queries about time series data from users. Events are for example: User U executed action A at time T. Different event types are split into different tables, currently around 20. As the number of events currently are around 20M and will reach 1B pretty soon, I decided to create aggregation tables. The aggregations are for example: How many users executed at least one action at a particular day, or total number of actions executed each day.
I have created insert triggers that inserts data into the aggregation tables whenever a row is inserted into the event tables. This works great and offers great performance with the current amount of events, and I think it should scale good to.
However, if I want to create a new aggregation only events from that point forward would be aggregated. To have all the old events included, they would have to be re-inserted. I see two ways this could be achieved. The first is to create a "re-run" function that essentially does the following:
Find all the tables this aggregation depends on, and all tables those aggregation depends on etc. until you have all direct and indirect dependencies.
Copy the tables to temporary tables
Empty the tables and the aggregation tables.
Re-insert data from the temporary tables.
This poses some questions about atomicity. What if an event is inserted after copying? Should one lock all the tables involved during this operation?
The other solution would be to keep track, for each aggregation table, which rows in the event tables that has been aggregated, and then at some point aggregate all the event that is missing from that track table. This seems to me less prone to concurrency errors, but requires a lot of tracking storage.
Are there any other solutions, and if not, which of the above would you choose?

Related

Database / Application Design - When to Denormalise?

I am working on an application which needs to store financial transactions for an account.
This data will then need to be queried in a number of ways. I'll need to list individual transactions, monthly totals by category, for example. I'll also need to show a monthly summary with opening / closing balances.
As I see it, I could approach this in the following ways:
From the point of view of database consistency and normalisation, this could be modelled as a simple list of transactions. Balances may then be calculated in the application by adding up every transaction from the beginning of time to the date of the balance you wish to display.
A slight variation on this would be to model the data in the same way, but calculate the balances in a stored procedure on the database server. Appreciate that this isn't hugely different to #1 - both of these issues will perform slower as more data gets added to the system.
End of month balances could be calculated and stored in a separate table (possibly updated by triggers). I don't really like this approach from a data consistency point of view, but it should scale better.
I can't really decide which way to go with this. Should I start with the 'purest' data model and only worry about performance when it becomes an issue? Should I assume performance will become a problem and plan for it from day one? Is there another option which I haven't thought of which would solve the issue better?
I would look at it like this, the calculations are going to take longer and longer and that majority of the monthly numbers before the previous 2-3 months will not be changing. This is a performance problem that has a 100% chance of happening as financial date will grow every month. Therefore looking at a solution in the design phase is NOT premature optimization, it is smart design.
I personally am in favor of only calculating such totals when they need to be calculated rather than every time they are queried. Yes the totals should be updated based on triggers on the table which will add a slight overhead to inserts and deletes. They will make the queries for selects much faster. In my experience users tend to be more tolerant of a slightly longer action query than a much longer select query. Overall this is a better design for this kind of data than a purely normalized model as long as you do the triggers correctly. In the long run, only calculating numbers that have changed will take up far less server resources.
This model will maintain data integrity as long as all transactions go through the trigger. The biggest culprit in that is usually data imports which often bypass triggers. If you do those kinds of imports make sure they have code that mimics the trigger code. Also make sure the triggers are for insert/update and delete and that they are tested using multiple record transactions not just against single records.
The other model is to create a data warehouse that populates on a schedule such as nightly. This is fine if the data can be just slightly out of date. If the majority of queries of this consolidated data will be for reporting and will not involve the current month/day so much then this will work well and you can do it in an SSIS package.

Fact table partitioning: how to handle updates in ETL?

We are trying to implement table partitioning for a Data Warehouse Fact table which contains approximately 400M rows. Our ETL takes data from source system 50 days backwards (new rows, modified rows, based on source system timestamp) from the previous load. So in every ETL cycle there are new rows coming in, and also old rows which are updating the corresponding rows in the Fact table. The idea is to insert new rows into the Fact table and update modified rows.
The partition column would be date (int, YYYYMMDD) and we are considering to partition by month.
As far as I'm concerned, table partitioning would ease our inserts via fast partition switch operations. We could split the most recent partition to create a new free partition, load new rows into a staging table (using date constraint, e.g for the most recent month) and then use partition switch operation to "move" new rows into the partitioned Fact table. But how can we handle the modified rows which should update the corresponding rows in the Fact table? Those rows can contain data from the previous month(s). Does partition switch help here? Usually INSERT and UPDATE rows are determined by an ETL tool (e.g. SSIS in our case) or by MERGE statement. How partitioning works in these kind of situations?
I'd take another look at the design and try to figure out if there's a way around the updates. Here are a few implications of updating the fact table:
Performance: Updates are fully logged transactions. Big fact tables also have lots of data to read and write.
Cubes: Updating the fact table requires reprocessing the affected partitions. As your fact table continues to grow, the cube processing time will continue to as well.
Budget: Fast storage is expensive. Updating big fact tables will require lots of fast reads and writes.
Purist theory: You should not change the fact table unless the initial value was an error (ie the user entered $15,000 instead of $1,500). Any non-error scenario will be changing the originally recorded transaction.
What is changing? Are the changing pieces really attributes of a dimension? If so, can they be moved to a dimension and have changes handled with a Slowly Changing Dimension type task?
Another possibility, can this be accomplished via offsetting transactions? Example:
The initial InvoiceAmount was $10.00. Accounting later added $1.25 for tax then billed the customer for $11.25. Rather than updating the value to $11.25, insert a record for $1.25. The sum amount for the invoice will still be $11.25 and you can do a minimally logged insert rather than a fully logged update to accomplish.
Not only is updating the fact table a bad idea in theory, it gets very expensive and non-scalable as the fact table grows. You'll be reading and writing more data, requiring more IOPS from the storage subsytem. When you get ready to do analytics, cube processing will then throw in more problems.
You'll also have to constantly justify to management why you need so many IOPS for the data warehouse. Is there business value/justification in needing all of those IOPS for your constant changing "fact" table?
If you can't find a way around updates on the fact table, at least establish a cut-off point where the data is determined read-only. Otherwise, you'll never be able to scale.
Switching does not help here.
Maybe you can execute updates concurrently using multiple threads on distinct ranges of rows. That might speed it up. Be careful not to trigger lock escalation so you get good concurrency.
Also make sure that you update the rows mostly in ascending sort order of the clustered index. This helps with disk IO (this technique might not work well with multi-threading).
There are as many reasons to update a fact record as there are non-identifying attributes in the fact. Unless you plan on a "first delete" then "insert", you simply cannot avoid updates. You cannot simply say "record the metric deltas as new facts".

Using a denormalized database table for analytic data

In an online ticketing system I've built, I need to add real-time analytical reporting on orders for my client.
Important order data is split over multiple tables (customers, orders, line_items, package_types, tickets). Each table contains additional data that is unimportant to any report my client may need.
I'm considering recording each order as a separate line item in a denormalized report table. I'm trying to figure out if this makes sense or not.
Generally, the queries I'm running for the report only have to join across two or three of the tables at a time. Each table has the appropriate indices added.
Does it make sense to compile all of the order data into one table that contains only the necessary columns for the reporting?
The application is built on Ruby on Rails 3 and the DB is Postgresql.
EDIT: The goal of this would be to render the data in the browser as fast as possible for the user.
depends on what your goal is. if you want to make the report outputs faster to display then that would certainly work. the trade off is that the data is somewhat maintained through batch updates. You could write a trigger that updates the table anytime a new record comes in to the base tables, but that could potentially add a lot of overhead.
Maybe a view instead of a new table is a better solution in this case?

How to auto remove an expired record from a database?

We are building a large stock and forex trading platform using a relational database. At any point during the day there will be thousands, if not millions, of records in our Orders table. Some orders, if not fulfilled immediately, expire and must be removed from this table, otherwise, the table grows very quickly. Each order has an expiration time. Once an order expires it must be deleted. Attempting to do this manually using a scheduled job that scans and deletes records is very slow and hinders the performance of the system. We need to force the record to basically delete itself.
Is there way to configure any RDBMS database to automatically remove a record based on a date/time field if the time occurs in the past?
Since you most likely will have to implement complex order handling, e.g. limit orders, stop-limit orders etc. you need a robust mechanism for monitoring and executing orders in real time. This process is not only limited to expired orders. This is a core mechanism in a trading platform and you will have to design a robust solution that fulfill your needs.
To answer your question: Delete expired orders as part of your normal order handling.
Why must the row be deleted?
I think you are putting the cart before the horse here. If a row is expired, it can be made "invisible" to other parts of the system in many ways, including views which only show orders meeting certain criteria. Having extra deleted rows around should not hamper performance if your database is appropriately indexed.
What level of auditing and tracking is necessary? Is no analysis ever done on expired orders?
Do fulfilled orders become some other kind of document/entity?
There are techniques in many databases which allow you to partition tables. Using the partition function, it is possible to regularly purge partitions (of like rows) much more easily.
You have not specified what DB you are using but lets assume you use MSSQL you could create a agent job that runs periodicly, but you are saying that that might not be a solution for you.
So what t about having an Insert Trigger that when new record is inserted you delete all the record that are expired? This will keep number of record all relatively small.

Do triggers decreases the performance? Inserted and deleted tables?

Suppose i am having stored procedures which performs Insert/update/delete operations on table.
Depending upon some criteria i want to perform some operations.
Should i create trigger or do the operation in stored procedure itself.
Does using the triggers decreases the performance?
Does these two tables viz Inserted and deleted exists(persistent) or are created dynamically?
If they are created dynamically does it have performance issue.
If they are persistent tables then where are they?
Also if they exixts then can i access Inserted and Deleted tables in stored procedures?
Will it be less performant than doing the same thing in a stored proc. Probably not but with all performance questions the only way to really know is to test both approaches with a realistic data set (if you have a 2,000,000 record table don't test with a table with 100 records!)
That said, the choice between a trigger and another method depends entirely on the need for the action in question to happen no matter how the data is updated, deleted, or inserted. If this is a business rule that must always happen no matter what, a trigger is the best place for it or you will eventually have data integrity problems. Data in databases is frequently changed from sources other than the GUI.
When writing a trigger though there are several things you should be aware of. First, the trigger fires once for each batch, so whether you inserted one record or 100,000 records the trigger only fires once. You cannot assume ever that only one record will be affected. Nor can you assume that it will always only be a small record set. This is why it is critical to write all triggers as if you are going to insert, update or delete a million rows. That means set-based logic and no cursors or while loops if at all possible. Do not take a stored proc written to handle one record and call it in a cursor in a trigger.
Also do not send emails from a cursor, you do not want to stop all inserts, updates, or deletes if the email server is down.
Yes, a table with a trigger will not perform as well as it would without it. Logic dictates that doing something is more expensive than doing nothing.
I think your question would be more meaningful if you asked in terms of whether it is more performant than some other approach that you haven't specified.
Ultimately, I'd select the tool that is most appropriate for the job and only worry about performance if there is a problem, not before you have even implemented a solution.
Inserted and deleted tables are available within the trigger, so calling them from stored procedures is a no-go.
It decreases performance on the query by definition: the query is then doing something it otherwise wasn't going to do.
The other way to look at it is this: if you were going to manually be doing whatever the trigger is doing anyway then they increase performance by saving a round trip.
Take it a step further: that advantage disappears if you use a stored procedure and you're running within one server roundtrip anyway.
So it depends on how you look at it.
Performance on what? the trigger will perform an update on the DB after the event so the user of your system won't even know it's going on. It happens in the background.
Your question is phrased in a manner quite difficult to understand.
If your Operation is important and must never be missed, then you have 2 choice
Execute your operation immediately after Update/Delete with durability
Delay the operation by making it loosely coupled with durability.
We also faced the same issue and our production MSSQL 2016 DB > 1TB with >500 tables and need to send changes(insert, update, delete) of few columns from 20 important tables to 3rd party. Number of business process that updates those few columns in 20 important tables were > 200 and it's a tedious task to modify them because it's a legacy application. Our existing process must work without any dependency of data sharing. Data Sharing order must be important. FIFO must be maintained
eg User Mobile No: 123-456-789, it change to 123-456-123 and again change to 123-456-456
order of sending this 123-456-789 --> 123-456-123 --> 123-456-456. Subsequent request can only be send if response of first previous request is successful.
We created 20 new tables with limited columns that we want. We compare main tables and new table (MainTable1 JOIN MainTale_LessCol1) using checksum of all columns and TimeStamp Column to Identify change.
Changes are logged in APIrequest tables and updated back in MainTale_LessCol1. Run this logic in Scheduled Job every 15 min.
Separate process will pick from APIrequest and send data to 3rd party.
We Explored
Triggers
CDC (Change Data Capture)
200+ Process Changes
Since our deadlines were strict, and cumulative changes on those 20 tables were > 1000/sec and our system were already on peak capacity, our current design work.
You can try CDC share your experience

Resources