Using lags or monthly categorical features for recognizing the seasonality with DeepAR and TFT from pytorch-forecasting - forecasting

I try to forecast monthly sales with the help of DeepAR and Temporal Fusion Transformer from pytorch-forecasting. The data I use has monthly seasonality, and the seasonality is the same or at least very similar for different countries.
While generating the TimeSeriesDataSet with pytorch-forecasting I could set the parameter lags for the target variable. The documentation says about it:
Lags can be useful to indicate seasonality to the models
I’m not sure if this is the better option than using the month or maybe month and country in a combination as a categorical feature to simplify the recognition of the seasonality.
Did anyone have own experience with this topic or an explanation what choice would be the best?
Thanks in advance!

DeepAR algorithm automatically generates feature for time series. Read more here
https://docs.aws.amazon.com/sagemaker/latest/dg/deepar_how-it-works.html
You can also add your own custom feature ( both categorical and continuous ) for each Timeseries. (e.g. Public holidays etc )
It works well when you have multiple time series with more than 300 data points for each.
All time series should have same frequency.
Benchmark on DeepAR and TFT is in your hands, I guess TFT will outperform.

Related

Does Apache Superset support Weighted Averages?

I'm trying to use Apache Superset to create a dashboard that will display the average rate of X/Y at different entities such that the time grain can be changed on the fly. However, all I have available as raw data is daily totals of X and Y for the entities in question.
It would be simple to do if I could just get a line chart that displayed sum(X)/sum(Y) as its own metric, where the sum range would change with the time grain, but that doesn't seem to be supported.
Creating a function in SQLAlchemy that calculates the daily rates and then uses that as the raw data is also an insufficient solution, since taking the average of that over different time ranges would not be properly weighted.
Is there a workaround I'm not seeing?
Is there a way to use Druid or some other tool to make displaying a quotient over a variable range possible?
My current best solution is to just set up different charts for each time grain size (day, month, quarter, year), but that's extremely inelegant and I'm hoping to do better.
There are multiple ways to do this, one is using the Metric editor as shown bellow, in this case the metric definition is stored as part of the chart.
Another way is to define a metric in the "datasource editor", where the metric will be stored with the datasource definition, and become reusable for any chart using this datasource, as shown here
Side note: depending on the database you use, you may have to CAST from say an integer to a numeric type as I did in the example, or multiply by 100 in order to get a proper result that's useful.

I'd like to hear thoughts about using time-series databases for this project:

The project is to collect longitudinal data on inmates in a state prison system with the goals of recognizing time-based patterns and empowering prison justice advocates. The question is what time-series DB should I use?
My starting point is this article:
https://medium.com/schkn/4-best-time-series-databases-to-watch-in-2019-ef1e89a72377
and it's looking like the first 3 (InfluxDB, TimescaleDB, OpenTSDB) are on the table, but not so much the last one (I'm dealing with much more than strictly numerical data)
Project details:
Currently I'm using Postgres and plan to update the schema to look like (in broad strokes):
low-volatility fields like: id number, name, race, gender, date-of-birth
higher-volatility fields like: current facility, release date, parole eligibility date, etc
time-series admin data: begin current, end current, period checked. Where this shows the time period the above 2 data fields are current and how often they were checked for changes.
I'm thinking it would be better to move to a time-series db and keep track of each individual update instead of having some descriptive info associated with a start date, end date, and period checked field. (like valid 2020-01-01 to 2021-08-25, checked every 14 days)
What I want to prioritize is speed of pulling reports (like what percentage of inmates grouped by certain demographics have exited the system before serving 90% of their sentence?) over insert throughput and storage space. I'm also interested in hearing opinions on ease of learning, prominence in the industry, etc.
My background:
I'm a bootcamp-educated generalist in data science with a background in CS. I've worked with SQL (Postgres, SQLite) and NoSQL (Mongo) databases in the past, and my DB-modeling ability is from an undergrad databases class. I'm most familiar with Java and Python (and many of the data science python packages), but learning a new language isn't a huge hurdle.
Thanks for your time!

Calculating Average Weekly Traffic in Data Studio

I've got a table in Google Data Studio showing monthly traffic numbers and I would like to have another table that shows average weekly traffic based on the monthly numbers in another table on the same page.
Having some trouble figuring out the custom calculated field formula for this. Any help would be appreciated.
This seems to work for me.
SUM(Sales)/COUNT_DISTINCT((EXTRACT(ISOWEEK from DATE)))
From your example, is it not as easy as your monthly traffic numbers / 4.34?
Depending on how you want to present this, there's a pretty easy, and decent solution - Reference Lines.
Using a reference line, you can chart weekly values (ie: weekly sessions) on a bar chart, then via a reference line, you can plot 1 line for the average of that period (all the bars currently present). Because it is native to the visualization it will recalculate as you you filter!

Which kind of DBs calculate rate per minute statistics?

I have a use case requirement, where I want to design a hashtag ranking system. 10 most popular hashtag should be selected. My idea is something like this:
[hashtag, rateofhitsperminute, rateofhisper5minutes]
Then I will query, find out the 10 most popular #hashtags, whose rateofhits per minute are highest.
My question is what sort of databases, can I use, to provide me statistics like 'rateofhitsperminute' ?
What is a good way to calculate such a detail and store in it db ? Do some DBs offer these features?
First of all, "rate of hits per minute" is calculated:
[hits during period]/[length of period]
So the rate will vary depending on how long the period is. (The last minute? The last 10 minutes? Since the hits started being recorded? Since the hashtag was first used?)
So what you really want to store is the count of hits, not the rate. It is better to either:
Store the hashtags and their hit counts during a certain period (less memory/cpu required but less flexible)
OR the timestamp and hashtag of each hit (more memory/cpu required but more flexible)
Now it is a matter of selecting the time period of interest, and querying the database to find the top 10 hashtags with the most hits during that period.
If you need to display the rate, use the formula above, but notice it does not change the order of the top hashtags because the period is the same for every hashtag.
You can apply the algorithm above to almost any DB. You can even do it without using a database (just use a programming language's builtin hashmap).
If performance is a concern and there will be many different hashtags, I suggest using an OLAP database. OLAP databases are specially designed for top-k queries (over a certain time period) like this.
Having said that, here is an example of how to accomplish your use case in Solr: Solr as an Analytics Platform. Solr is not an OLAP database, but this example uses Solr like an OLAP DB and seems to be the easiest to implement and adapt to your use case:
Your Solr schema would look like:
<fields>
<field name="hashtag" type="string"/>
<field name="hit_date" type="date"/>
</fields>
An example document would be:
{
"hashtag": "java",
"hit_date": '2012-12-04T10:30:45Z'
}
A query you could use would be:
http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hashtag&facet.mincount=1&facet.limit=10&facet.range=hit_date&facet.range.end=2013-01-01T00:00:00Z&facet.range.start=2012-01-01T00:00:00
Finally, here are some advanced resources related to this question:
Similar question: Implementing twitter and facebook like hashtags
What is the best way to compute trending topics or tags? An interesting idea I got from these answers is to use the derivative of the hit counts over time to calculate the "instantaneous" hit rate.
HyperLogLog can be used to estimate the hit counts if an approximate calculation is acceptable.
Look into Sliding-Window Top-K if you want to get really academic on this topic.
No database has rate per minute statistics just built in, but any modern database could be used to create a database in which you could quite easily calculate rate per minute or any other calculated values you need.
Your question is like asking which kind of car can drive from New York to LA - well no car can drive itself or refuel itself along the way (I should be careful with this analogy because I guess cars are almost doing this now!), but you could drive any car you like from New York to LA, some will be more comfortable, some more fuel efficient and some faster than others, but you're going to have to do the driving and refueling.
You can use InfluxDB. It's well suited for your use case, since it was created to handle time series data (for example "hits per minute").
In your case, every time there is a hit, you could send a record containing the name of the hashtag and a timestamp.
The data is queryable, and there are already tools that can help you process or visualize it (like Grafana).
If you are happy with a large data set you could store and calculate this information yourself.
I believe Mongo is fairly fast when it comes to index based queries so you could structure something like this.
Every time a tag is "hit" or accessed you could store this information as a row
[Tag][Timestamp]
Storing it in such a fashion allows you to first of all run simple Group, Count and Sort operations which will lead you to your first desired ability of calculating the 10 most popular tags.
With the information in this format you can then perform further queries based on tag and timestamp to Count the amount of hits for a specific tag between the times X and Y which would give you your hits Per period.
Benefits of doing it this way:
High information granularity depending on time frames supplied via query
These queries are rather fast in mongoDB or similar databases even on large data sets
Negatives of doing it this way:
You have to store many rows of data
You have to perform queries to retrieve the information you need rather than returning a single data row

predictions with qlickview

I am here to ask some small information regarding qlickview function and whether qlickview has some option regarding the prediction function or not
My Requirements:
I have some sales data from 2013 and 2014 and I want to predict the sales for 2015 what functions I can use to predict this specific data in qlickview ?
And not only sales but I have similar data for production and training for specific location and machine so if this works successfully for sales I can implement the predictions for other departments too
As there are lot of techniques and methods related to predictions I want to know which technique I need to apply in qlickview and how ?
Thank you
As you said, there are a lot of techniques and methods and you would have to combine them in QlikView as there's no one function that can do it for you. I would look into time series modelling (https://en.wikipedia.org/wiki/Time_series)
There's a good 3 part video tutorial on Youtube about time series modelling (https://www.youtube.com/watch?v=gHdYEZA50KE&feature=youtu.be). Although it is done in Excel, you can apply the same techniques in QlikView.
You would probably have to use linear regression. QlikView provides some analytical functions which you can use to calculate the slope and the y-intercept of a linear regression (linest_m and linest_b).
All in all I have found QlikView not to be very good at calculating such things. For example, if you find that instead of linear regression, polynomial regression fits your data better then you would have to implement a lot of it by yourself. Maybe it would be wise to use some statistical programming language (e.g. R, Octave) and present the results in QlikView.

Resources