Where should I calculate values? backend, a view or etl? - database

I have a table with n columns (static values). Based on the other columns, I need a calculated value. The calculation is difficult, but not time-consuming. I have several backend endpoints that use the table with the calculated value.
I wonder where to do the calculation
(1) approach: backend
Disadvantage: (Calculation in each endpoint. => redundancy, maybe I could write a method for that)
Disadvantage: slower (because every time I execute the endpoint, it needs to re-calculate the value)
Advantage: business logic in backend
(2) approach: view
Disadvantage: view => business logic in database
Disadvantage: slower (it needs to re-calculate the value every time I query the view)
(3) etl
A ETL-Job provides the calculated value
Disadvantage: business logic in ETL
Advantage: fast (the value is calculated once)
I'm using code first approach with .net core.
Ich weiß es gibt Entwickler, die möglichst viel in der Datenbank haben wollen. Microsoft hingegen geht mehr Richtung Code First.

Related

Populate the domain model from data layer, or query database direct?

Say I have a domain model where 3 objects interact, Reservation, Vehicle and Fleet. The Fleet has many Vehicles, and each Vehicle can have many Reservations. e.g.
Fleet -1--*- Vehicle -1--*- Reservation
If I want Fleet to have a method getMostPopularVehicle(), I could have it iterate each Vehicle and count the number of Reservations.
If I then want to introduce an ORM for persistence, should I (1) have getMostPopularVehicle() call a data layer method to populate the Fleet, Vehicles and Reservations before iterating it as before? Or should I (2) now just query the database directly to get the most popular vehicle in the data layer method?
My thinking is that (1) is correct, but a database query can be so efficient. Perhaps I am approaching this all wrong?
Both approaches are valid. It depends on what you want to achieve; if you can get performance by issueing a (HQL or JPQL or whatever your ORM supports) query, which uses your domain model as well, it is quite legal to do this.
If these statistics were expected to be readily available I'd probably go for one of 2 other options that would avoid executing a potentially heavy duty query as often (if at all):
Retrieve the data by querying the database directly (avoiding the domain model) using CQRS. The returned view models can be cached to avoid subsequent users causing the query to execute again.
Create a separate reporting database. Key the values on fleet id and vehicle id and maybe increment the value by 5 for every reservation and then decrement EVERY value by 1 once a week, never going any lower than 0. This way, you can keep fairly representative rolling statistics, whilst being able to quickly query the table and find the highest value row for a given fleet.

Database / Application Design - When to Denormalise?

I am working on an application which needs to store financial transactions for an account.
This data will then need to be queried in a number of ways. I'll need to list individual transactions, monthly totals by category, for example. I'll also need to show a monthly summary with opening / closing balances.
As I see it, I could approach this in the following ways:
From the point of view of database consistency and normalisation, this could be modelled as a simple list of transactions. Balances may then be calculated in the application by adding up every transaction from the beginning of time to the date of the balance you wish to display.
A slight variation on this would be to model the data in the same way, but calculate the balances in a stored procedure on the database server. Appreciate that this isn't hugely different to #1 - both of these issues will perform slower as more data gets added to the system.
End of month balances could be calculated and stored in a separate table (possibly updated by triggers). I don't really like this approach from a data consistency point of view, but it should scale better.
I can't really decide which way to go with this. Should I start with the 'purest' data model and only worry about performance when it becomes an issue? Should I assume performance will become a problem and plan for it from day one? Is there another option which I haven't thought of which would solve the issue better?
I would look at it like this, the calculations are going to take longer and longer and that majority of the monthly numbers before the previous 2-3 months will not be changing. This is a performance problem that has a 100% chance of happening as financial date will grow every month. Therefore looking at a solution in the design phase is NOT premature optimization, it is smart design.
I personally am in favor of only calculating such totals when they need to be calculated rather than every time they are queried. Yes the totals should be updated based on triggers on the table which will add a slight overhead to inserts and deletes. They will make the queries for selects much faster. In my experience users tend to be more tolerant of a slightly longer action query than a much longer select query. Overall this is a better design for this kind of data than a purely normalized model as long as you do the triggers correctly. In the long run, only calculating numbers that have changed will take up far less server resources.
This model will maintain data integrity as long as all transactions go through the trigger. The biggest culprit in that is usually data imports which often bypass triggers. If you do those kinds of imports make sure they have code that mimics the trigger code. Also make sure the triggers are for insert/update and delete and that they are tested using multiple record transactions not just against single records.
The other model is to create a data warehouse that populates on a schedule such as nightly. This is fine if the data can be just slightly out of date. If the majority of queries of this consolidated data will be for reporting and will not involve the current month/day so much then this will work well and you can do it in an SSIS package.

Should I normalize an entity to one-to-one relationship model in gae

I have a student entity which already has about 12 fields.Now, I want to add 12 more fields(all related to his academic details).Should I normalize(as one-to-one) and store it in a different entity or should I keep on adding the information in Student entity only.
I am using gaesession to store the logged in user in memory
session = get_current_session()
session['user'] = user
Will this affect in the read and write performance/cost of the app? Does cost of storing an entity in the memcache(FE instance) related to the number of attributes stored in an entity?
Generally the costs of either writing two entities or fetching two entities will be greater than the cost of writing or fetching a single entity.
Write costs are associated with the number of indexed fields. If you're adding indexed fields, that would increase the write cost whenever those fields are modified. If an indexed field is not modified and the index doesn't need to be updated, you do not incur the cost of updating that index. You're also not charged for the size of the entity, so from a cost perspective, sticking with a single entity will be cheaper.
Performance is a bit more complicated. Performance will be affected by 1) query overhead and 2) the size of the entities you are fetching.
If you have two entities, you're going to suffer double the query overhead, since you'll likely have to query/fetch the base student entity and then issue a second query/fetch for the second entity. There may be certain ways around this if you are able to fetch both entities by id asynchronously. If you need to query though, you're perf is likely going to suffer whenever you need to query for the 2nd entity.
On the flip side, perf scales negatively with entity size. Fetching 100 1MB entities will take significantly longer than fetching 100 500 byte entities. If your extra data is large, and you typically query for many student entities at once, then storing the extra data in a separate entity such that the basic student entity is small, you can increase performance significantly for the cases where you don't need the 2nd entity.
Overall, for performance, you should consider your data access patterns, and try to minimize extraneous data fetching for the common fetching situation. ie if you tend to only fetch one student at a time, and you almost always need all the data for that student, then it won't affect your cost to load all the data.
However, if you generally pull lists of many students, and rarely use the full data for a single student, and the data is large, you may want to split the entities.
Also, that comment by #CarterMaslan is wrong. You can support transactional updates. It'll actually be more complicated to synchronize if you have parts of your data in separate entities. In that case you'll need to make sure you have a common ancestor between the two entities to do a transactional operation.
It depends on how often these two "sets" of data need to be retrieved from datastore. As a general principle in GAE, you should de-normalize your data, thus in your case store all properties in the same model. This, will result in more write operations when you store an entity but will reduce the get and query operations.
Memcache is not billable, thus you don't have to worry about memcache costs. Also, if you you use ndb (and I recommend you to do so), caching in memcache is automatically handled.

Does storing aggregated data go against database normalization?

On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.

Where should I handle concurrent operations on the same data? In the application or the database?

So I am building a Java webapp with Spring and Hibernate. In the application userw can add points to a object and I'd like to count the points given to order my objects. The objects are also stored in the database. And hopefully hundreds of people will give points to the objects at the same time.
But how do I count the points and save them in the database at the same time? Usually I would just have a property on my object and just increase the points. But that would mean that I have to lock the data in the database with a pessimistic transaction in order to prevent concurrency issues (reading the amount of points while another thread is half way through changing it already). That would possibly make my app much slower (at least I imagine it would).
The other solution would be to store the amount of given points in an associated object and store them separately in the database while counting the points in memory within a "small" synchronized block or something.
Which solution has the least performance impact when handling many concurrent operations on the same objects. Or are there any other fitting solutions?
If you would like the values to be persisted, then you should persist them in your database.
Given that, try the following:
create a very narrow row, like just OBJ_ID and POINTS.
create an index only on OBJ_ID, so not a lot of time is spent updating indexes when values are inserted, updated or deleted.
use INNODB, which has row-level locking, so the locks will be smaller
mysql will give you the last committed (consistent) value
That's all pretty simple. Give a whirl! Setup a test case that mimics your expected load and see how it performs. Post back if you get stuck.
Good luck.

Resources