Do I need to use future_map or map to parallelize fable forecast? - forecasting

I've created a tsibble of ~75K time series in R Studio on my local machine.
I'm looking for ways to speed up the processing time before I migrate the process to a VM with more processing power.
Does Fable handle all of the parallel processing in the background or are there more opportunities to make the code more efficient?
Here is an example of my code
plan(multisession, gc= TRUE)
tic()
results <- train %>%
group_by_key() %>%
model(my_dcmp_spec) %>%
forecast(h="10 weeks") %>%
ungroup()
toc()
Thank you in advance!

Currently fable will model each of the series in parallel (model()) according to your plan(). The forecasts will not yet be done in parallel, but this is planned for an upcoming release: https://github.com/tidyverts/fabletools/issues/268

Related

How to use iteration in R to simplify my code for GLM?

I've just started using R and am having some issues when trying to simplify my code. I can't share my real data, but have used an open dataset to ask my question (Breed to represent my IV and Age to represent a DV).
In my dataset, I have all factor variables - my independent variable has 3 levels and my dependent variables all have 2 levels (0/1). Out of a larger dataset, I have six dependent variables and would like to run some descriptive stats and GLM for each. I have figured out working code for running each DV independently, see below. However, I am currently just copying & pasting this code and replacing the DV variables each time. I would like to instead create a function that I can apply to simplify my code.
I have attempted to do this using the purr package (map) but have had no luck. If someone could provide an example of how to do this using the sample data below, it would help me out a lot (though I know in the below data there is only one DV provided).
install.packages("GLMsData")
library(GLMsData)
data(butterfat)
library(tidyverse)
library(dplyr)
#Descriptive summaries
butterfat %>%
group_by(Breed, Age) %>%
summarise(n())
prop.table(table(butterfat$Breed, butterfat$Age), 1)
#Model
Age_model1 <- glm(Age ~ Breed, family=binomial, data=butterfat, na.action = na.omit)
#Get summary, including coefficients and p-values
summary(Age_model1)
#See coefficients, get odds ratio and confidence intervals
Age_model1$coefficients
exp(Age_model1$coefficients)
exp(confint(Age_model1))
Many Models: R for Data Science
library(tidyverse)
iris %>%
group_by(Species) %>%
nest() %>%
mutate(lm = map(data,
~glm(Sepal.Length~Sepal.Width,data = .))) %>%
mutate(tidy = map(lm,
broom::tidy)) %>%
unnest(tidy)

Quickly update django model objects from pandas dataframe

I have a Django model that records transactions. I need to update only some of the fields (two) of some of the transactions.
In order to update, the user is asked to provide additional data and I use pandas to make calculations using this extra data.
I use the output from the pandas script to update the original model like this:
for i in df.tnsx_uuid:
t = Transactions.objects.get(tnsx_uuid=i)
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
t.save()
this is very slow. What is the best way to do this?
UPDATE:
after some more research, I found bulk_update and changed the code to:
transactions = Transactions.objects.select_for_update()\
.filter(tnsx_uuid__in=list(df.tnsx_uuid)).only('start_bal', 'end_bal')
for t in transactions:
i = t.tnsx_uuid
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
Transactions.objects.bulk_update(transactions, ['start_bal', 'end_bal'])
this has approximately halved the time required.
How can I improve performance further?
I have been looking for the answer to this question and haven't found any authoritative, idiomatic solutions. So, here's what I've settled on for my own use:
transaction = Transactions.objects.filter(tnsx_uuid__in=list(df.tnsx_uuid))
# Build a DataFrame of Django model instances
trans_df = pd.DataFrame([{'tnsx_uuid': t.tnsx_uuid, 'object': t} for t in transactions])
# Join the Django instances to the main DataFrame on the index
df = df.join(trans_df.set_index('tnsx_uuid'))
for obj, start_bal, end_bal in zip(df['object'], df['start_bal'], df['end_bal']):
obj.start_bal = start_bal
obj.end_bal = send_bal
Transactions.objects.bulk_update(df['object'], ['start_bal', 'end_bal'])
I don't know how DataFrame.loc[] is implemented but it could be slow if it needs to search the whole DataFrame for each use rather than just do a hash lookup. For that reason and to just simply things by doing a single iteration loop, I pulled all of the model instances into df and then used the recommendation from a Stackoverflow answer on iterating over a DataFrames to loop over the zipped columns of interest.
I looked at the documentation for select_for_update in Django and it isn't apparent to me that it offers a performance improvement, but you may be using it to lock the transaction and make all of the changes atomically. Per the documentation, bulk_update should be faster than saving each object individually.
In my case, I'm only updating 3500 items. I did some timing of the various steps and came up with the following:
3.05 s to query and build the DataFrame
2.79 ms to join the instances to df
5.79 ms to run the for loop and update the instances
1.21 s to bulk_update the changes
So, I think you would need to profile your code to see what is actually taking time, but it is likely a Django issue rather than a Pandas issue.
I kind of face the same issue (almost same quantity of records 3500~), and I will like to add:
bulk_update seems to be a lot worse in performance than a
bulk_create, in my case deleting objects was allowed, so
instead of bulk_updating, I delete all objects, and then recreate them.
I used the same approach as you (thanks for the idea), but with some modifications:
a) I create the dataframe from the query itself:
all_objects_values = all_objects.values('id', 'date', 'amount')
self.df_values = pd.DataFrame.from_records(all_objects_values )
b) Then I create the column of objects without iterating (I make sure these are ordered):
self.df_values['object'] = list(all_objects)
c) For updating object values (after operations made in my dataframe), I iterate rows(not sure about performance difference):
for index, row in self.df_values.iterrows():
row['object'].amount= row['amount']
d) At the end, I re-create all objects:
MyModel.objects.bulk_create(self.df_values['object'].tolist())
Conclusion:
In my case, the most time consuming was the bulk update, so re-creating objects solved it for me (from 19 seconds with bulk_update to 10 seconds with delete + bulk_create)
In your case, using my approach may improve the time for all other operations.

Sequential scenarios in the same simulation with Gatling

It's possible multi scenarios in the same simulation?
Example:
I have to execute the same test during 2H, but in each interval time (10min), so I inject different rate user/second (per scenario).
I have tests with:
setUp(
scenario1.inject(atOnceUsers(1)).scenario2.inject(2)
).protocols(httpProtocolBuilder)
but according to results, only executes the last scenario configuration, in example case, for 2 users.
Someone could help me?
so you just want two different scenarios running at the same time?
if so, what you want is...
setUp(
scenario1.inject(
atOnceUsers(1)
).protocols(httpProtocolBuilder),
scenario2.inject(
atOnceUsers(1)
).protocols(httProtocolBuilder)
)

Server out-of-memory issue when using RJDBC in paralel computing environment

I have an R server with 16 cores and 8Gb ram that initializes a local SNOW cluster of, say, 10 workers. Each worker downloads a series of datasets from a Microsoft SQL server, merges them on some key, then runs analyses on the dataset before writing the results to the SQL server. The connection between the workers and the SQL server runs through a RJDBC connection. When multiple workers are getting data from the SQL server, ram usage explodes and the R server crashes.
The strange thing is that the ram usage by a worker loading in data seems disproportionally large compared to the size of the loaded dataset. Each dataset has about 8000 rows and 6500 columns. This translates to about 20MB when saved as an R object on disk and about 160MB when saved as a comma-delimited file. Yet, the ram usage of the R session is about 2,3 GB.
Here is an overview of the code (some typographical changes to improve readability):
Establish connection using RJDBC:
require("RJDBC")
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")
After this there is some code that sorts the function input vector requestedDataSets with names of all tables to query by number of records, such that we load the datasets from largest to smallest:
nrow.to.merge <- rep(0, length(requestedDataSets))
for(d in 1:length(requestedDataSets)){
nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1]
}
merge.order <- order(nrow.to.merge,decreasing = T)
We then go through the requestedDatasets vector and load and/or merge the data:
for(d in merge.order){
# force reconnect to SQL server
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
try(dbDisconnect(con), silent = T)
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>")
# remove the to.merge object
rm(complete.data.to.merge)
# force garbage collection
gc()
jgc()
# ask database for dataset d
complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d]))
# first dataset
if (d == merge.order[1]){
complete.data <- complete.data.to.merge
colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1"
}
# later dataset
else {
complete.data <- merge(
x = complete.data,
y = complete.data.to.merge,
by.x = "key_1", by.y = "key", all.x=T)
}
}
return(complete.data)
When I run this code on a serie of twelve datasets, the number of rows/columns of the complete.data object is as expected, so it is unlikely the merge call somehow blows up the usage. For the twelve iterations memory.size() returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217, and 2361. Which, again, is strange as the dataset at the end is at most 162 MB...
As you can see in the code above I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection jgc <- function(){.jcall("java/lang/System", method = "gc")}). I've also tried merging the data SQL-server-side, but then I run into number of columns constraints.
It vexes me that the RAM usage is so much bigger than the dataset that is eventually created, leading me to believe there is some sort of buffer/heap that is overflowing... but I seem unable to find it.
Any advice on how to resolve this issue would be greatly appreciated. Let me know if (parts of) my problem description are vague or if you require more information.
Thanks.
This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2) storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.
I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...
You can't force a garbage collection in Java, calling System.gc() only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.

Quickly adding edge counts to a document in ArangoDB

Not too complicated: I want to count the edges of each document and save the number in the document. I've come up with two queries that work; unfortunately since I have millions of edges both are quite slow. Is there a faster way to update documents with a property storing their number of edges? (just a count at a point in time)
AQL queries that are functional but slow:
FOR doc IN Documents
LET inEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'inbound', maxDepth:1})
LET outEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'outbound', maxDepth:1})
UPDATE doc WITH {inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
or:
FOR e IN Edges
COLLECT docId = e._to WITH COUNT INTO counter
UPDATE SPLIT(docId,'/')[1] WITH {inEdgeCount: counter}
(and then repeat for outbound edges)
As an aside, is there any way to view either query speed (e.g. FOR executions per second) or percentage completion? I've been trying to judge speed by using LIMITed queries to start with, but the time required doesn't seem to scale linearly.
With ArangoDB 2.8 you can use graph pattern matching traversals to execute this with better performance:
FOR doc IN documents
LET inEdgesCount = LENGTH(FOR v IN 1..1 INBOUND doc GRAPH 'edgeGraph' RETURN 1)
LET outEdgesCount = LENGTH(FOR v IN 1..1 OUTBOUND doc GRAPH 'edgeGraph' RETURN 1)
UPDATE doc WITH
{inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
Currently ArangoDB doesn't have a way to monitor the progress of long running tasks. With ArangoDB 3.0 we're going to introduce a new monitoring framkework that allows better inspection of whats actually going on in the server. However, with 3.0 it won't be able to gather live statistics; we may see this further down the 3.x road later this year. Judging percentage completion may become possible for easy tasks like creating indices, but on queries its rather going to be the number of documents read/written so far.
We did similar queries for validating whether a graph obeys a power law

Resources