Using a denormalized database table for analytic data - database

In an online ticketing system I've built, I need to add real-time analytical reporting on orders for my client.
Important order data is split over multiple tables (customers, orders, line_items, package_types, tickets). Each table contains additional data that is unimportant to any report my client may need.
I'm considering recording each order as a separate line item in a denormalized report table. I'm trying to figure out if this makes sense or not.
Generally, the queries I'm running for the report only have to join across two or three of the tables at a time. Each table has the appropriate indices added.
Does it make sense to compile all of the order data into one table that contains only the necessary columns for the reporting?
The application is built on Ruby on Rails 3 and the DB is Postgresql.
EDIT: The goal of this would be to render the data in the browser as fast as possible for the user.

depends on what your goal is. if you want to make the report outputs faster to display then that would certainly work. the trade off is that the data is somewhat maintained through batch updates. You could write a trigger that updates the table anytime a new record comes in to the base tables, but that could potentially add a lot of overhead.
Maybe a view instead of a new table is a better solution in this case?

Related

Is there any penalty on creating a view from another view?

I have tables that are historicized and then views are created from them to retain only the most recent and active data.
I wanted to make views that would aggregate some of these views together, where I would create my view as a SELECT * FROM {Other view(s)}. So a bit like this:
Table -> Intermediate View -> Aggregated View
I'm just wondering if I'll run into any performance hit by basing my view on other views. Should I just instead have my aggregated views be more complex code-wise, but based directly on the underlying tables?
Table -> Aggregated View
Or does it not make a difference at all?
Thanks a lot.
From a performance viewpoint, it doesn't make any difference - unless you are making your view out of a single table, in which case you would be able to Materialize your view - in fact, one of the biggest limitations of Materialized Views is that the FROM has to refer to a single table.
From a software engineering viewpoint, I see many advantages like more reusable work and more flexible and, potentially, faster development (while developer-A works on View-A, developer-B works on View-B, and developer-C could even work on View-C to combine View-A and View-C).
The downside is the increase in complexity of the lineage of the views which might require a graphical representation in some cases where objects are too many.
I have found myself doing this more and more in Snowflake, to the point where I'm writing blog and giving it a new acronym, ELVT. I've built a 3 layer stacking of VIEWs at one client. Lowest level is simple against a single table with presentation names for each column. Next layer is business logic for the underlying single table VIEW. 3rd level is joining VIEWs and more complex business logic (lot's of UDFs).
I have a meta-data repository from which I generate all of the VIEWs (which also provides lineages).
The final VIEWs have 35+ joins against 40+ physical tables. Salesforce, Marketo, Eloqua and others.
SELECT * against multiple years of data using medium DW averaged 1min, 25s.
These VIEW replaced thousands of lines of QLIK scripting with SELECT FROM VIEW.
One point to note, is if you are comparing writing one really large block of SQL to nested views, aka macro's.
Then they will perform the same.
The down side to nested views, is you are selecting a lot of columns (in the SQL that is getting compiled) so if at the top level you are not using most the columns, you SQL compile times will be marginally slower.
Also sometime if you put a filter for say a date range, over a large volume of SQL the optimizer can fail to push the filters down, and you can then pull/compute large amounts of data that are later thrown away.
We found this happened, and the optimizer behavior can change with releases. Sometime for the better sometimes for much worse.
We ended up using table functions for a number of parts of SQL to force the date range into the lower layer "views". But we also controlled the layer writing the dynamic SQL so this was an easy substitution.
it depends upon what type of processing you are doing in the View, if it is a lot then you can create a Materialized view (this requires storage, and hence will incur some cost).
1 st option try creating a View and if it does not help then try MV.

How to store order history in database, especially old picture paths?

(table: order_items)
I'm not sure if this is the correct way to implement an order history table in my database. Normally, I'm trying to reduce the redundancy. But because the user can change data in his/her offer, I need to save the minimum information of the order.
Goal: Buyer can see his/her old orders with correct title/pictures/origin path/allergens (long story...)
What speaks against my approach?
The only "fear" is that the table is going to be bloated with a lot of redundancy information.
This started out as a comment but it's getting too long, so...
What database are you working with?
SQL Server, for instance, introduced the concept of temporal tables in 2016 version. Basically you have two tables identical in structure, where one is the main table where you can use DML just as you would with normal table, and the other is a readonly table that's storing the historical data - so when you update a record in the main table, what is actually happening is that the record gets copied into the history table first, and updated later.
Something similar might exists in other databases as well, and can also be quite easily manually implemented using triggers in case your database does not provide it out of the box.
Of course, you could use the technique called "soft delete", where instead of actually deleting the data you simply mark it as deleted, and instead of updating the data you create a new record with the updated data, and change the status of the existing record to Inactive.
The major advantage of this approach over temporal tables is that you still only have one table for your entity instead of two - but on the other hand, the advantage of temporal tables is that the active data is being kept in a separate table from the historical data, therefor the active data is stored in a relatively small table and as a result, all CRUD operations is more efficient.
The "fear" of having a bloated table in this day and age when memory and storage are so cheep seems a bit strange to me.

SQLite performance advice for .net

I am using SQLite in my application. The scenario is that I have stock market data and each company is a database with 1 table. That table stores records which can range from couple thousand to half a million.
Currently when I update the data in real time I - open connection, check if that particular data exists or not. If not, I then insert it and close the connection. This is then done in a loop and each database (representing a company) is updated. The number of records inserted is low and is not the problem. But is the process okay?
An alternate way is to have 1 database with many tables (each company can be a table) and each table can have a lot of records. Is this better or not?
You can expect at around 500 companies. I am coding in VS 2010. The language is VB.NET.
The optimal organization for your data is to make it properly normalized, i.e., put all data into a single table with a company column.
This is better for performance because the table- and database-related overhead is reduced.
Queries can be sped up with indexes, but what indexes you need depends on the actual queries.
I did something similar, with similar sized data in another field. It depends a lot on your indexes. Ultimately, separating each large table was best (1 table per file, representing a cohesive unit, in you case one company). Plus you gain the advantage of each company table being the same name, versus having x tables of different names that have the same scheme (and no sanitizing of company names to make new tables required).
Internally, other DBMSs often keep at least one file per table in their internal structure, SQL is thus just a layer of abstraction above that. SQLite (despite its conceptors' boasting) is meant for small projects and querying larger data models will get more finicky in order to make it work well.

Keeping history of data revisions - best practice?

Consider a database with several (3-4) tables with a lot of columns (from 15 to 40). In each table we have several thousand records generated per year and about a dozen of changes made for each record.
Right now we need to add a following functionality to our system: every time user makes a change to the record of one of our tables, the system needs to keep track of it - we need to have complete history of changes and also be able to restore row data to selected point.
For some reasons we cannot keep "final" and "historic" data in the same table (so we cannot add some columns to our tables to keep some kind of versioning information, i.e. like wordpress does when it comes to keeping edit history of posts).
What would be best approach to this problem? I was thinking about two solutions:
For each tracked table we have a mirror table with the same columns, and with additional columns where we keep information about versions (i.e. timestamps, id of "original" row etc...)
Pros:
we have data stored exactly in the same way it was in original tables
whenever we need to add a new column to the original table, we can do the same to mirror table
Cons:
we need to create one additional mirror table for each tracked table.
We create one table for "history" revisions. We keep some revisioning information like timestamps etc., and also we keep the track from which table the data originates. But the original data row is being stored in large text column in JSON.
Pros:
we have only one history table for all tracked tables
we don't need to create new mirror tables every time we add new tracked table,
Cons:
there can be some backward compatibility issues while trying to restore data after structure of the original table was changed (i.e. new column was added)
Maybe some other solution?
What would be the best way of keeping the history of versions in such system?
Additional information:
each of the tracked tables can change in the future (i.e. new columns added),
number of tracked tables can change in the future (i.e. new tables added).
FYI: we are using laravel 5.3 and mysql database.
How often do you need access to the auditing data? Is cost of storage ever a concern? Do you need it in the same system that you need the normal data?
Basically, having a table called foo and a second table called foo_log isn't uncommon. It also lets you store foo_log somewhere differently, even possibly a secondary DB. If foo_log is on a spindle disk and foo is on flash, you still get fast reads, but you get somewhat cheaper storage of the backups.
If you don't ever need to display this data, and just need it for legal reasons, or to figure out how something went wrong, the single-table isn't a terrible plan.
But if the issue is backups, which it sounds like it might be, why not just backup the MySQL database on a regular basis and store the backups elsewhere?

Adding VBA Array to New Access DB

I'm pretty proficient with VBA, but I know almost nothing about Access! I'm running a complex simulation using Arrrays in VBA, and I want to store the results somewhere. Since the results of the simulation will be quite large (~1GB in memory), I'd like to store this in Access rather than Excel.
I currently have a large number of Arrays populated with my data, but I'm not sure how to write these to a database, or even how to create one with VBA. Here's what I need to do, in a nutshell, with VBA:
Create a new Access Database
Create a new Access Table (the db will be only a single table)
Create ~1200 fields programmatically
Copy the results from my arrays to the new Access table.
I've looked at a number of answers on here, but none of them seem to answer my question fully. For instance, Adding field to MS Access Table using VBA talks about adding fields to a database. But I don't see doubles listed here. Most of my arrays are doubles. Will this be a problem?
EDIT:
Here are a few more details about the project:
I am running a network design simulation. Thus, I start by generating ~150,000 unique networks. Then, I run a lot of calculations (no, these can't be simplified to queries unfortunately!) of characteristics for the network. There end up being ~1200 for each possible network (unique record). Thus, I would like to store these in an Access database. Each record will be a unique network, and each field will be a specific characteristic associated with that network.
Virtually all of the fields (arrays at this point!) are doubles.
You (almost?) never want a database with one table. You might as well store it in a text file. One of the main benefits of databases is relating data in different tables, and with one table you don't need it.
Fortunately for you, you need more than one table and a database might be the way to go. You (almost) never need to create permanent tables in code (temp tables, sure, but not permanent ones). If your field names are variable, you need to change your design. When data is variable, it goes in the data part of a database. When it's fixed, it can be a table or a field. Based on what you've said, I think you need this:
In Access create a tabled called tblNetworks with these fields
NetworkID AutoNumber
NetworkName Short Text
Then create another tabled called tblCalculations with these fields
CalcID Autonumber
NetworkID Long (Relates to tblNetworks, one to many)
CalcDesc Short Text
Result Number (Double)
What you were going to name your fields in your Access table will be the CalcDesc data. You'll use ADODB to execute INSERT INTO sql statements that put the data in the tables.
You'll end with tblNetworks with 150k records and tblCalculations with 1,200 x 150k records or so. When you're tables grow longer and not wider as things change, that a good indication you designed it right.
If you're really unfamiliar with Access, I recommend learning how to create Tables, setting up Relationships, and Referential Integrity. If you don't know SQL, search for INSERT INTO. And if you haven't used ADO before in Excel, search for ADODB Connections and the Execute method.
Update
You can definitely get away with a CSV for this. Like you said, it's pretty low overhead. Whether a text file or a database is the right answer probably depends more on how you're going to use the data and how often.
If you're going to pull this into Excel a small number of times, do a few sorts or filters, maybe a pivot table, then any performance hit you get from a CSV isn't going to be that bad. And if you only need to deal with a subset of the data at a time, you can use ADO to read a text file and only pull in the data you want at that time, further mitigating the slowness of sorting and filtering 150k rows. Not to mention if you have a few gigs of RAM, 150k x 1,200 probably won't be bad at all.
If you find that the performance of a CSV stinks because your hardware isn't up to the task, you have to access it often, or you doing a ton of different queries against the data, it may be to your benefit to use the database. If you fields are structured as you say, you may benefit from even more tables. You'd still have the network table and the calc table, but you'd also have Market, Slot, and Characteristic tables. Then your Calc table would look like:
CalcID
CalcDesc
NetworkID
MarketID
SlotID
CharacteristicID
Result
If you looking for data a lot of times and you need it quickly, you're not going to do better than a bunch of INNER JOINs on those tables and a WHERE clause that limits what you want.
But only you can decide if it's worth all the setup and overhead of using a database. And because of that, I would start down the CSV path until the reason to change presented itself. I would design my code in a way that switching from CSV to database only touched a few procedure (like by using class modules) so that the change didn't affect any already-tested business logic.

Resources