Analysing the performance of forecasting models - forecasting

I am just getting into studying forecasting methods and I want to figure out how performance is commonly measured. My instinct is that out-of-sample performance is most important (you want to see how well your model does with unseen data). I have also noticed that forecast performance does not do well if your out-of-sample data is too large (which makes sense the farther you go in the future, the less likely your model will perform well). So I was wondering how to determine the best size of out-of-sample data to test on?

I think you are confusing the forecasting horizon with the out-of-sample data to test on the forecasting performance, when you say " I have also noticed that forecast performance does not do well if your out-of-sample data is too large".
When you do forecasting, you are usually interested in a certain forecasting horizon. For example, if you have a time series at monthly frequency, you might be interested in one month horizon (short-term forecast) or 12 months (long-term forecasting). So the forecasting performance usually deteriorates with longer forecasting horizons, not with more out-of-sample data.
It is hard to suggest the number of observations on which you test your model because it depends on how you want to evaluate the forecast. If you want to use some formal statistical tests, then you need more observations, but if you are interested in predicting a certain event and you are just interest in the performance of a single model, then you are fine with a relatively low number of out-of-sample observations.
Hope this helps,
Paolo

Related

which one is better to calculate star rating average (in performance sense)

i thinking which one is the best way to show average star rating. is it better to calculate the average when there's a review with star value given, and store the average in DB field, so when i load a page i just check 1 field's value? or calculate the average each time user loading the page?
Without a sample schema, and an idea of typical usage, it's almost impossible to provide a good answer.
The question you pose is "should I denormalize my database" - there are lots of other questions on this topic.
From a performance point of view, the question boils down to "how often do you have to write, how often do you have to read, and how important is it that data is consistent?".
If your application user experience is such that "star ratings" are shown almost never, and calculating that star rating is "cheap", then the performance impact is low.
If you are showing long, scrolling pages with items, each with a star rating, the performance benefit could be high, especially if calculating the star rating is an expensive operation.
If it's important that star ratings are exactly accurate in all cases, you will have to add some additional logic like locking behaviour which could have a huge impact on your database.
If your application experience means that you may have periods with very high numbers of new ratings, you could have a significant performance impact on the "write" operation.
In general, it's best to design your application to be normalized (so it's easy to debug and maintain), and to measure whether you need to do anything more. Modern database engines can handle far more than most people realize.
** update **
Thanks for your update.
Your schema suggestion should be lightning fast without denormalization - you should be joining on a foreign key on the reviews table. It all depends on the exact circumstances, but unless you need to scale to hundreds of millions of products and reviews, I doubt you'd ever see a measurable difference in database performance. The logic to keep the "average score" column updated may be more of a performance overhead than calculating it on the fly.
In my experience, denormalization is an expensive thing to do - it makes your code much harder to understand and debug, and leads to entertaining bugs. From a performance point of view, if you're building a website, you'll get a much better return by focusing on caching at the HTTP level.

Database System for timeseries & aggregation

I'm currently creating a raspberry pi based logging device for logging the power which is fed into the grid by a solar array.
The "main table" will be growing at ~ 20 entries representing the "current" power produced by several parts of the array.
Basically this isn't that much and can be handled at an acceptable performance using a raspberry pi, but with a growing amount of data queries like "select last 10 years, group by month" probably wouldn't be very effective... (the data should be displayed via an interactive web interface)
I thought of doing some "background aggregation" and maintaining several tables for containing the aggregated data of various timeframes, but this seems like a problem which probably has been dealt with by many people before.
What do you suggest me to do?
You do not know how much data growth is needed to affect performance.
You do not know by how much performance will be affected then.
You do not know if performance will be affected at all.
As long as you do not have even an estimate of how much performance improvement you need, it does not make sense to try to do optimizations.
Or, as said by Donald Knuth:
premature optimization is the root of all evil
If you really do want to create caches of aggregated values, I'd suggest to use triggers to keep the cache consistent after any change to the original data.

2 Questions about Philosophy and Best Practices in Database Development

Which one is best, regarding the implementation of a database for a web application: a lean and very small database with only the bare information, sided with a application that "recalculates" all the secondary information, on demand, based on those basic ones, OR, a database filled with all those secondary information already previously calculated, but possibly outdated?
Obviously, there is a trade-of there and I think that anyone would say that the best answer to this question is: "depends" or "is a mix between the two". But I'm really not to comfortable or experienced enough to reason alone about this subject. Could someone share some thoughts?
Also, another different question:
Should a database be the "snapshot" of a particular moment in time or should a database accumulate all the information from previous time, allowing the retrace of what happened? For instance, let's say that I'm modeling a Bank Account. Should I only keep the one's balance on that day, or should I keep all the one's transactions, and from those transactions infer the balance?
Any pointer on this kind of stuff that is, somehow, more deep in database design?
Thanks
My quick answer would be to store everything in the database. The cost of storage is far lower than the cost of processing when talking about very large scale applications. On small scale applications, the data would be far less, so storage would still be an appropriate solution.
Most RDMSes are extremely good at handling vast amounts of data, so when there are millions/trillions of records, the data can still be extracted relatively quickly, which can't be said about processing the data manually each time.
If you choose to calculate data rather than store it, the processing time doesn't increase at the same rate as the size of data does - the more data ~ the more users. This would generally mean that processing times would multiply by the data's size and the number of users.
processing_time = data_size * num_users
To answer your other question, I think it would be best practice to introduce a "snapshot" of a particular moment only when data amounts to such a high value that processing time will be significant.
When calculating large sums, such as bank balances, it would be good practice to store the result of any heavy calculations, along with their date stamp, to the database. This would simply mean that they will not need calculating again until it becomes out of date.
There is no reason to ever have out of date pre-calulated values. That's what trigger are for (among other things). However for most applications, I would not start precalculating until you need to. It may be that the calculation speed is always there. Now in a banking application, where you need to pre-calculate from thousands or even millions of records almost immediately, yes, design a precalulation process bases on triggers that adjust the values every time they are changed.
As to whether to store just a picture in time or historical values, that depends largely on what you are storing. If it has anything to do with financial data, store the history. You will need it when you are audited. Incidentally, design to store some data as of the date of the action (this is not denormalization). For instance, you have an order, do not rely onthe customer address table or the product table to get data about where the prodcts were shipped to or what they cost at the time of the order. This data changes over time and then you orders are no longer accurate. You don't want your financial reports to change the dollar amount sold because the price changed 6 months later.
There are other things that may not need to be stored historically. In most applications we don't need to know that you were Judy Jones 2 years ago and are Judy Smith now (HR application are usually an exception).
I'd say start off just tracking the data you need and perform the calculations on the fly, but throughout the design process and well into the test/production of the software keep in mind that you may have to switch to storing the pre-calculated values at some point. Design with the ability to move to that model if the need arises.
Adding the pre-calculated values is one of those things that sounds good (because in many cases it is good) but might not be needed. Keep the design as simple as it needs to be. If performance becomes an issue in doing the calculations on the fly, then you can add fields to the database to store the calculations and run a batch overnight to catch up and fill in the legacy data.
As for the banking metaphor, definitely store a complete record of all transactions. Store any data that's relevant. A database should be a store of data, past and present. Audit trails, etc. The "current state" can either be calculated on the fly or it can be maintained in a flat table and re-calculated during writes to other tables (triggers are good for that sort of thing) if performance demands it.
It depends :) Persisting derived data in the database can be useful because it enables you to implement constraints and other logic against it. Also it can be indexed or you may be able to put the calculations in a view. In any case, try to stick to Boyce-Codd / 5th Normal Form as a guide for your database design. Contrary to what you may sometimes hear, normalization does not mean you cannot store derived data - it just means data shouldn't be derived from nonkey attributes in the same table.
Fundamentally any database is a record of the known facts at a particular point in time. Most databases include some time component and some data is preserved whereas some is not - requirements should dictate this.
You've answered your own question.
Any choices that you make depend on the requirements of the application.
Sometimes speed wins, sometimes space wins. Sometime data accuracy wins, sometimes snapshots win.
While you may not have the ability to tell what's important, the person you're solving the problem for should be able to answer that for you.
I like dynamic programming(not calculate anything twise). If you're not limited with space and are fine with a bit outdated data, then precalculate it and store in the DB. This will give you additional benefit of being able to run sanity checks and ensure that data is always consistent.
But as others already replied, it depends :)

Efficient dataset size for a feed-foward neural network training

I'm using a feed-foward neural network in python using the pybrain implementation. For the training, i'll be using the back-propagation algorithm. I know that with the neural-networks, we need to have just the right amount of data in order not to under/over-train the network. I could get about 1200 different templates of training data for the datasets.
So here's the question:
How do I calculate the optimal amount of data for my training? Since I've tried with 500 items in the dataset and it took many hours to converge, I would prefer not to have to try too much sizes. The results we're quite good with this last size but I would like to find the optimal amount. The neural network has about 7 inputs, 3 hidden nodes and one output.
How do I calculate the optimal amount
of data for my training?
It's completely solution-dependent. There's also a bit of art with the science. The only way to know if you're into overfitting territory is to be regularly testing your network against a set of validation data (that is data you do not train with). When performance on that set of data begins to drop, you've probably trained too far -- roll back to the last iteration.
The results were quite good with this
last size but I would like to find the
optimal amount.
"Optimal" isn't necessarily possible; it also depends on your definition. What you're generally looking for is a high degree of confidence that a given set of weights will perform "well" on unseen data. That's the idea behind a validation set.
The diversity of the dataset is much more important than the quantity of samples you are feeding to the network.
You should customize your dataset to include and reinforce the data you want the network to learn.
After you have crafted this custom dataset you have to start playing with the amount of samples, as it is completely dependant on your problem.
For example: If you are building a neural network to detect the peaks of a particular signal, it would be completely useless to train your network with a zillion samples of signals that do not have peaks. There lies the importance of customizing your training dataset no matter how many samples you have.
Technically speaking, in the general case, and assuming all examples are correct, then more examples are always better. The question really is, what is the marginal improvement (first derivative of answer quality)?
You can test this by training it with 10 examples, checking quality (say 95%), then 20, and so on, to get a table like:
10 95%
20 96%
30 96.5%
40 96.55%
50 96.56%
you can then clearly see your marginal gains, and make your decision accordingly.

Computed Values

Like most people, I work on a data-driven object-oriented business application. I use a relational database to store my data.
I have designed my application so far that I never store computed values. That is, if a user would like to consult the output of a "simulation" he ran last year, my application would simply recompute the report from existing historical data, instead of reading the result of the simulation. Since the report takes very little time to create - it's mostly simple arithmetic - can I safely assume I can get by without storing the result of their reports? I'm having a hard time figuring out a future business requirement which would make me regret not having stored the results in the first place.
Not storing the result of computations reduces redundancy and is generally, in my opinion, a good thing - but it comes down to a case of normalization vs. computational power required for an operation. Often, databases don't get 100% normalized because we don't live in the ideal world where that would be ideal (infinite CPU power and I/O speed), and so it should really be determined on a case-by-case basis.
If you can't foresee a need for storing the results of a computation in the DB, I'd suggest you don't store it. A DB is generally easier to maintain, the more normalized it is.
Tax changes, any other regulation that apply to All the data.
You can get around it by using values at the time, but when it involves any calculation that's changed complexity starts to build up.
If you have all the data you and can compute simulations based on that data then you should be fine. If in the future you find that these simulations are taking too long to run then you can begin storing the computed values and simply change your application to pull historical values from there.

Resources