Weekly data prediction with Neural Prophet - forecasting

Hi there is something which is unclear to me when reading the articles and documentation.
They say (https://github.com/facebook/prophet/issues/2112) that the underlying model is continuous.
For me it is hard to understand as normally their X matrix is based on data points at time t no?
For example if I have a weekly seasonality normally i should not use a frequency which is lower or equal to twice a month as the data is sampled weekly.
But under the hood do they interpolate the time series to daily points?
In an experiment with weekly data I have force the weekly seasonality to 8 coefficients and it increases drastically the performances on an unseen validation set ( from 0.3 R2 to 0.6 without weekly and with weekly seasonality which for me doesn't make sense).
I of course predict on weekly data too
So I am not sure whether they first interploate to daily and then perform seasonality computations
Can someone help me understand please?

Related

Is QuestDB suitable for querying interval data, such as energy meters

I want to design a database to store IoT data from several utility meters (electricity, gas, water) using mqtt protocol.
Is QuestDB suitable for this type of data where it would store meter readings (it is the difference between readings that is mainly of interest as opposed to the readings themselves)? More specifically, I am asking if the database would allow me to quickly and easily query the following? If so, some example queries would be helpful.
calculate energy consumption of a specific/all meters in a given date period (essentially taking the difference between readings between two dates for example)
calculate the rate of energy consumption with time over a specific period for trending purposes
Also, could it cope with situations where a faulty meter is replaced and therefore the reading is reset to 0 for example, but consumption queries should not be affected?
My initial ideas:
Create a table with Timestamp, MeterID, UtilityType, MeterReading, ConsumptionSinceLastReading fields
When entering a new meter reading record, it would calculate consumption since last reading and store it in the relevant table field. Although this doesn't seem like the right approach and perhaps a time series db like QuestDB has a built-in solution for this kind of problem?
I think you have approached the problem the right way.
By storing both the actual reading and the difference to the previous reading you have everything you need.
You could calculate the energy consumption and the rate with SQL statements.
Consumption for a period:
select sum(ConsumptionSinceLastReading) from Readings
where ts>'2022-01-15T00:00:00.000000Z' and ts<'2022-01-19T00:00:00.000000Z';
Rate (consumption/sec) within a period:
select (max(MeterReading)-min(MeterReading)) / (cast(max(ts)-min(ts) as long)/1000000.0) from (
select MeterReading, ts from Readings
where ts>'2022-01-15T00:00:00.000000Z' and ts<'2022-01-20T00:00:00.000000Z' limit 1
union
select MeterReading, ts from Readings
where ts>'2022-01-15T00:00:00.000000Z' and ts<'2022-01-20T00:00:00.000000Z' limit -1
);
There is no specific built-in support for your use case in QuestDB but it would do a good job. The above selects are optimised since they use the LIMIT keyword.

How to generate hourly weather data for 8760 (Entire Year) using PvLib Python

Instede of reading TMY file in to PvLib, I wants to generate weather data using PvLib function, class or modules.
I have found some of function to generate weather forecast using "from pvlib.forecast import GFS, NAM, NDFD, HRRR, RAP" these modules.
Above mention method/algorithm has some limitation. It generate data for limited period. Some of the modules are generating only for 7 days or 1 months.
Also it gives data for 3 hourly time stamp difference.
Is there any possibility to interpolate weather data for entire year using PvLib?
Forecast is generally meant to be used for future prediction, and is limited in time and accuracy inversely proportionate: longer future forecasts have less accuracy, and accuracy decreases the further in the future it is. For example, the forecast for today is more accurate than forecast for tomorrow, and so on. This is the reason that forecast is limited as you are forecasting for seven future days.
Forecast providers as GFS may or may not provide data for historic forecasts; it depends on the provider and their services.
As I remember, GFS gives prediction in old file fashion, so I moved to providers that gives online REST services forecast, as I become first a programmer and then a data scientist and never a meteorologist.
When timeseries period is not in your required period, you can do some resampling. Extra values will be mathematically calculated with some formula that—as long you don't know the original provider's formula—resample formula will be likely different.

How to deal with a dataset with "periods of time" and missing data

I'm working on a dataset, which has as columns points in time (e.g. August, September, etc.) and as rows different measurements which were collected at that point.
Apart from that, the data is not clean at all, the are a lot of missing data and I just can't drop all the rows with them or filling them up so my idea was to divide the dataset in 4 smaller ones.
What kind of analysis can be performed on a dataset of this kind? Should I invert columns and rows?
A timeseries regression with missing data is a special case within statistical analysis. Simply re-jigging the data set is not the solution.
I understand periodicity analysis and spectral analysis is performed to identify the sinosoid of best fit, i.e. a sine wave is driven through the missing data points and regression is one approach in identifying the fit to the existing data.
The same question has been previously raised on Stats exchange based on ARIMA (moving average). Personally, I am not overawed by this approach because there will be a specialist solution.
https://stats.stackexchange.com/questions/121414/how-do-i-handle-nonexistent-or-missing-data

Analysing the performance of forecasting models

I am just getting into studying forecasting methods and I want to figure out how performance is commonly measured. My instinct is that out-of-sample performance is most important (you want to see how well your model does with unseen data). I have also noticed that forecast performance does not do well if your out-of-sample data is too large (which makes sense the farther you go in the future, the less likely your model will perform well). So I was wondering how to determine the best size of out-of-sample data to test on?
I think you are confusing the forecasting horizon with the out-of-sample data to test on the forecasting performance, when you say " I have also noticed that forecast performance does not do well if your out-of-sample data is too large".
When you do forecasting, you are usually interested in a certain forecasting horizon. For example, if you have a time series at monthly frequency, you might be interested in one month horizon (short-term forecast) or 12 months (long-term forecasting). So the forecasting performance usually deteriorates with longer forecasting horizons, not with more out-of-sample data.
It is hard to suggest the number of observations on which you test your model because it depends on how you want to evaluate the forecast. If you want to use some formal statistical tests, then you need more observations, but if you are interested in predicting a certain event and you are just interest in the performance of a single model, then you are fine with a relatively low number of out-of-sample observations.
Hope this helps,
Paolo

Why does my cube compute so slowly at the lowest drill down level?

I'm still learning the ropes of OLAP, cubes, and SSAS, but I'm hitting a performance barrier and I'm not sure I understand what is happening.
So I have a simple cube, which defines two simple dimensions (type and area), a third Time dimension hierarchy (goes Year->Quarter->Month->Day->Hour->10-Minute), and one measure (sum on a field called Count). The database tracks events: when they occur, what type are, where they occurred. The fact table is a precalculated summary of events for each 10 minute interval.
So I set up my cube and I use the browser to view all my attributes at once: total counts per area per type over time, with drill down from Year down to the 10 Minute Interval. Reports are similar in performance to the browse.
For the most part, it's snappy enough. But as I get deeper into the drill-tree, it takes longer to view each level. Finally at the minute level it seems to take 20 minutes or so before it displays the mere 6 records. But then I realized that I could view the other minute-level drilldowns with no waiting, so it seems like the cube is calculating the entire table at that point, which is why it takes so long.
I don't understand. I would expect that going to Quarters or Years would take longest, since it has to aggregate all the data up. Going to the lowest metric, filtered down heavily to around 180 cells (6 intervals, 10 types, 3 areas), seems like it should be fastest. Why is the cube processing the entire dataset instead of just the visible sub-set? Why is the highest level of aggregation so fast and the lowest level so slow?
Most importantly, is there anything I can do by configuration or design to improve it?
Some additional details that I just thought of which may matter: This is SSAS 2005, running on SQL Server 2005, using Visual Studio 2005 for BI design. The Cube is set (as by default) to full MOLAP, but is not partitioned. The fact table has 1,838,304 rows, so this isn't a crazy enterprise database, but it's no simple test db either. There's no partitioning and all the SQL stuff runs on one server, which I access remotely from my work station.
When you are looking at the minute level - are you talking about all events from 12:00 to 12:10 regardless of day?
I would think if you need that to go faster (because obviously it would be scanning everything), you will need to make the two parts of your "time" dimension orthogonal - make a date dimension and a time dimension.
If you are getting 1/1/1900 12:00 to 1/1/1900 12:10, I'm not sure what it could be then...
Did you verify the aggregations of your cube to ensure they were correct? Any easy way to tell is that if you get the same amount of records no matter what drill-tree you go down.
Assuming this is not the case, what Cade suggests about making a Date dimension AND a Time dimension would be the most obvious approach but it is one bigger no-no's in SSAS. See this article for more information: http://www.sqlservercentral.com/articles/T-SQL/70167/
Hope this helps.
I would also check to ensure that you are running the latest sp for sql server 2005
The RTM version had some SSAS perf issues.
also check to ensure that you have correctly define attribute relationships on you time dimension and other dims as well.
Not having these relationships defined will the SSAS storage engine to scan more data then necessary
more info: http://ms-olap.blogspot.com/2008/10/attribute-relationship-example.html
as stated above, splitting out the date and time will significantly decrease the cardinality of your date dimension which should increase performance and allow a better analytic experience.

Resources