I was using SARIMAX for timeseries forecasting and getting decent results. I am currently exploring the TensorFlow Probability and came across "tfp.sts.forecast".
About data: I have sample timeseries data with hourly data. This data can have hourly as well as weekly seasonality.
For buidling the model, as mentioned in samples, I tried setting the effect as below
hour_of_day_effect = tfp.sts.Seasonal(
num_seasons=24,
num_steps_per_season=1,
observed_time_series=observed_time_series,
allow_drift=True,
name='hour_of_day_effect')
day_of_week_effect = tfp.sts.Seasonal(
num_seasons=168,
num_steps_per_season=1,
observed_time_series=observed_time_series,
allow_drift=True,
name='day_of_week_effect')
autoregressive = tfp.sts.Autoregressive(
order=1,
observed_time_series=observed_time_series,
name='autoregressive')
model = tfp.sts.Sum([hour_of_day_effect,
day_of_week_effect,
autoregressive],
observed_time_series=observed_time_series)
My question is about "day_of_week_effect". For my hourly data, if I set it up as
num_seasons=7
num_steps_per_season=24
I do not get good results.
For example, if I see spike on every Friday, that is not captured correctly if the values are set as above.
But if I set them up as
num_seasons=168,
num_steps_per_season=1,
it is captured correctly. (I arrived at 168 as 24 * 7)
Could you please let me know about this behavior?
Related
I have daily sales data between 2013-02-18 to 2017-02-12 that has only 4 days of data missing (all the Xmases on the 25th of each year). These holidays have a sale volume of zero. My purpose is to understand how to staff my store for the upcoming week by short-term predicting my sales for the next 5-7 days worth of data.
I start by setting up this data as a time series:
ts <- ts(mydata, frequency = 365)
and then an initial analysis through a decomposition:
This seems to show I have a declining sales trend, but there is some seasonality, if I'm not mistaken. So, to start my forecast implementation, I fit an arima model for the first two years worth of data by doing:
fit <- auto.arima(ts[1:730], stepwise = FALSE, approximation = FALSE)
Series: ts[1:730]
ARIMA(4,1,1)
Coefficients:
ar1 ar2 ar3 ar4 ma1
0.3638 -0.2290 -0.1451 -0.2075 -0.8958
s.e. 0.0413 0.0388 0.0388 0.0398 0.0241
sigma^2 estimated as 15424930: log likelihood=-7068.67
AIC=14149.33 AICc=14149.45 BIC=14176.88
This model doesn't seem right to me, because it does not include any seasonality. I know I have enough data. Rob Hyndman's blog said to try using ets which also showed no seasonality. What am I not understanding about this data series or the forecasting methodology?
I've re-asked this question more appropriately in the stats exchange forums. Could someone please close this question here in stackexchange for me?
The questions is now here.
https://stats.stackexchange.com/questions/295012/forecast-5-7-day-sales
I intend to analyse multiple data sets on the same time series (daily EOD). I will need to use computed columns. Use column A + B to create column C (store net result of calculation in column C). Is this functionality available using the MongoDB / Arctic database?
I would also intend to search the data... for example: What happens when the advance decline thrust pushes over 70 when the cumulative TICK was below -100,000 in the past 'n days'
Two data sets: Cumulative TICK and the Advance Decline Thrust (Uses advancers / decliners data). So they would be stored in the database, then I would want to have the capability to search for the above condition. This is achievable with the mongoDB / Arctic database structure?
Just looking for some general information before I move to a DB format. Currently everything I had created is on excel / VBA now its alrady out grown!
Any information greatly appreciated.
Note: I will use the same database for weekly, monthly, yearly and 1 minute, 3 minute, 5 minute 60 minute TICK/TIME based bars - not feeding live but updated EOD
yes, this can be done with arctic. Arctic can store pandas dataframes, and an operation like you have mentioned is trivial in pandas. Arctic is just a store, so you'd want to read the data out of arctic (data is stored in symbols in arctic) and then perform your transform, and then write the data back. Any of the storage engines (VersionStore, TickStore, or ChunkStore) should work for this.
Background Information: We have an incident time tracker that tracks how long each user spends with a representative before the issue can be closed. We want to determine the average volume of incidents that are being handled for each hour. To say this in another way: We want to get an hourly baseline for each day of the week that will show us the average total call length within the specific time period. Eg: We want to average the total length of every call on Monday from 9AM-10AM for all the weeks in the database, and the same for other hourly intervals.
The simplest way to think of this is that I want AVG(SUM) for the specific time periods, but Tableau does not allow me to do this.
Tableau Output:
This is the desired, target visualization that I am looking for from Tableau.
SQL Query:
I have written a SQL query that returns the answer:
We are looking at two columns: start_time (time stamp) and interval_seconds(float)
In the inner query I use the hour_start function which truncates the date/time value to the hour start, so I can group by the hour and day of the week in the outer query.
SQL Results:
Question:
Is there a way to solve this problem ENTIRELY in Tableau that would get me the result that I am looking for without having to write any SQL code?
Files Stored on Drive
CSV File:
https://drive.google.com/open?id=0B4nMLxIVTDc7NEtqWlpHdVozRXc
Tableau Worksheet:
https://drive.google.com/open?id=0B4nMLxIVTDc7M3A4Q0JxbGdlTE0
You can use Level of Detail expressions to compute the SUM(interval_seconds) at the hour level and then use AVG to calculate the number you are looking for.
I created a couple of calculations:
hour which is defined as: DATETRUNC('hour',[start_time])
this should be equivalent to your hour_start(start_time).
and interval_hours which is defined as {FIXED [hour] : SUM([interval_seconds])/3600 }
This calculates the aggregate for each start_time truncated to the hour.
After this, you simply calculate AVG(interval_hours) and use it in your view.
I put a workbook in dropbox: https://www.dropbox.com/s/3hfvz8w529g9f46/Interval%20Time%20Baseline.twbx?dl=0
Although the chart looks similar to yours, the numbers I came up with are somewhat different from the "SQL Results" you show. Was the data you provided slightly different?
I am looking for some help as to the best way to structure data in app engine ndb using python, process it and query it later. I want to store temperature data at hourly intervals for different geographical regions.
I can think of two entity options but there maybe something much better. The first would be to store the hourly temperature in individual properties:
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
00:00 = ndb.FloatProperty()
01:00 = ndb.FloatProperty()
...
23:00 = ndb.FloatProperty()
Or I could store the data
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
time = ndb.TimeProperty()
temp = ndb.FloatProperty()
(it might be better to store date and time as one property?)
I want to be able to query the datastore to calculate the Total, Max, Min, and average temperature for any given date range. In the first option I could potentially create 4 more properties to effectively pre-process and store the Total, Max etc for each day so if I wanted to query the total temperature for a year I would only have to sum 365 values as opposed to 8760? I'm not sure how I would do this in the second option?
I am relatively new to app engine and datastore and I think I am still thinking in terms of relationship db's so any help would really be appreciated. Later on it might be necessary to store data in different time zones.
Thanks
Paul
Personally, I'd go with a variant of the first approach:
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
temp = ndb.FloatProperty(repeated=True)
using the temp list to store temperatures by hour in order as you learn about them. I don't think the preprocessing per-date will add anything much: to compute whatever for a year, you'd still need to fetch 365 entities, and the delay for that will swamp the tiny amount of time required to sum up a few thousand numbers anyway.
In general, preprocessing is useful if you want to handily query by the new fields you create by such processing (e.g rapidly answer the question "which dates in locale X had average temperatures greater than 20 Celsius"). That does not seem to be your use case.
If anything, if it's common for you to have to compute many-month values, preprocessing to aggregate things per-month (into simpler TempDataMonth entities) may be more useful. Or, any other several-days period you find useful, of course (weeks, ten-day-groups, whatever). Those could be computed in a background task periodically checking which such periods have become complete since the last check. But, this is a bit beyond your question, so I'm not getting into fine-grained details.
The general idea is that minimizing the number of entities to fetch tends to be the single most important optimization; other optimizations are of course also possible, but, they tend to play second fiddle to that:-).
I have a system where people can pick some stocks and it values their portfolios but I'm having trouble doing this in a efficient way on a daily basis because I'm creating entries for days that don't have any changes(think of it like I'm measuring the values and having version control so I can track changes to the way the portfolio is designed).
Here's a example(each day's portfolio with stock name and weight):
Day1:
ibm = 10%
microsoft = 50%
google = 40%
day5:
ibm = 20%
microsoft = 20%
google = 40%
cisco = 20%
I can measure the value of the portfolio on day1 and understand I need to measure it again on day5(when it changed) but how do I measure day2-4 without recreating day1's entry in the database?
My approach right now(which I don't like) is to create a temp entry in my database for when someone changes the portfolio and then at the end of the day when I calculate the values if there is a temp entry I use that otherwise I create a new entry(for day2-4) using the last days data. The issue is as data often doesn't change I'm creating entries that are basically duplicates. The catch is: my stock data is all daily. I also thought of taking the portfolio and if it hasn't been updated in 3 days to find the returns of the last 3 days for each stock but I wasn't sure if there was a better solution.
Any ideas? I think this is a straight forward problem but I just can't see a efficient way of doing it.
note: in finance terms, its called creating a NAV and most firms do it the inefficient way I'm doing it but its because the process was created like 50 years ago and hasn't changed. I think this problem is very similar to version control but I can't seem to make a solution.
In storage terms is makes most sense to just store:
UserId - StockId1 - 23% - 2012-06-25
UserId - StockId2 - 11% - 2012-06-26
UserId - StockId1 - 20% - 2012-06-30
So you see that stock 1 went down at 30th. Now if you want to know the StockId1 percentage at the 28th you just select:
SELECT *
FROM stocks
WHERE datecolumn<=DATE(2012-06-28)
ORDER BY datecolumn DESC LIMIT 0,1
If it gives nothing back you did not have it, otherwise you get the last position back.
BTW. if you need for example a graph of stock 1 you could left join against a table full of dates. Then you can fill in the gaps easily.
Found this post here for example:
UPDATE mytable
SET number = (#n := COALESCE(number, #n))
ORDER BY date;
SQL QUERY replace NULL value in a row with a value from the previous known value