Server out-of-memory issue when using RJDBC in paralel computing environment - sql-server

I have an R server with 16 cores and 8Gb ram that initializes a local SNOW cluster of, say, 10 workers. Each worker downloads a series of datasets from a Microsoft SQL server, merges them on some key, then runs analyses on the dataset before writing the results to the SQL server. The connection between the workers and the SQL server runs through a RJDBC connection. When multiple workers are getting data from the SQL server, ram usage explodes and the R server crashes.
The strange thing is that the ram usage by a worker loading in data seems disproportionally large compared to the size of the loaded dataset. Each dataset has about 8000 rows and 6500 columns. This translates to about 20MB when saved as an R object on disk and about 160MB when saved as a comma-delimited file. Yet, the ram usage of the R session is about 2,3 GB.
Here is an overview of the code (some typographical changes to improve readability):
Establish connection using RJDBC:
require("RJDBC")
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")
After this there is some code that sorts the function input vector requestedDataSets with names of all tables to query by number of records, such that we load the datasets from largest to smallest:
nrow.to.merge <- rep(0, length(requestedDataSets))
for(d in 1:length(requestedDataSets)){
nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1]
}
merge.order <- order(nrow.to.merge,decreasing = T)
We then go through the requestedDatasets vector and load and/or merge the data:
for(d in merge.order){
# force reconnect to SQL server
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
try(dbDisconnect(con), silent = T)
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>")
# remove the to.merge object
rm(complete.data.to.merge)
# force garbage collection
gc()
jgc()
# ask database for dataset d
complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d]))
# first dataset
if (d == merge.order[1]){
complete.data <- complete.data.to.merge
colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1"
}
# later dataset
else {
complete.data <- merge(
x = complete.data,
y = complete.data.to.merge,
by.x = "key_1", by.y = "key", all.x=T)
}
}
return(complete.data)
When I run this code on a serie of twelve datasets, the number of rows/columns of the complete.data object is as expected, so it is unlikely the merge call somehow blows up the usage. For the twelve iterations memory.size() returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217, and 2361. Which, again, is strange as the dataset at the end is at most 162 MB...
As you can see in the code above I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection jgc <- function(){.jcall("java/lang/System", method = "gc")}). I've also tried merging the data SQL-server-side, but then I run into number of columns constraints.
It vexes me that the RAM usage is so much bigger than the dataset that is eventually created, leading me to believe there is some sort of buffer/heap that is overflowing... but I seem unable to find it.
Any advice on how to resolve this issue would be greatly appreciated. Let me know if (parts of) my problem description are vague or if you require more information.
Thanks.

This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2) storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.
I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...
You can't force a garbage collection in Java, calling System.gc() only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.

Related

How to select drillbits to run a query using offset?

For example I have two drillbits running on different machines and a table with 200 rows. Is it possible to manually choose drillbit1 to fetch first 100 rows and drillbit2 fetch next 100 rows using offset query and get a merged result(total 200 rows)?
Because in my case I'm having a parquet file of size roughly 500kb but I'm not able to get result of the query select * from dfs.'/path/to/parquet/file'; without limit through web ui as it returns with error:
RESOURCE ERROR: There is not enough heap memory to run this query using the web interface.
Please try a query with fewer columns or with a filter or limit condition to limit the data returned.
You can also try an ODBC/JDBC client.
Following is the configuration of both the drillbits:
Size of RAM on machine = 8G DRILLBIT_MAX_PROC_MEM = "6G"
DRILL_HEAP = "2G" DRILL_MAX_DIRECT_MEMORY = "3G"Apache Drill version: 1.14.0
I've ran the following queries as suggested on other sites for avoiding heap memory space error:
alter session set planner.width.max_per_node = 1
alter system set planner.width.max_per_query = 2
But I still face the heap space error. Any help would be appreciated.
You don't need to manage drillbits execution and merge results. Drill does it for you internally. And data in Drill doesn't seat on Heap, it uses Direct memory. Heap is used mainly for Drill planning and execution process.
Looks like you have an issue, because your Drill is very limited in memory. The recommended Heap size for Drill is 4-8G, please see details here: https://drill.apache.org/docs/configuring-drill-memory/
Currently all Drill unit tests can't pass with 8G memory machines, which used as CI (TravisCI and CircleCI): https://github.com/apache/drill/blob/master/.circleci/config.yml#L52

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

Inserting Large Object into Postgresql returns 53200 Out of Memory error

Postgresql 9.1
NPGSQL 2.0.12
I have binary data I am wanting to store in a postgresql database. Most files load fine, however, a large binary (664 Mb) file is causing problems. When trying to load the file to postgresql using Large Object support through Npgsql, the postgresql server returns 'out of memory' error.
I'm running this at present on a workstation with 4Gb RAM, with 2Gb free with postgresql running in an idle state.
This is the code I am using, adapted from PG Foundry Npgsql User's Manual.
using (var transaction = connection.BeginTransaction())
{
try
{
var manager = new NpgsqlTypes.LargeObjectManager(connection);
var noid = manager.Create(NpgsqlTypes.LargeObjectManager.READWRITE);
var lo = manager.Open(noid, NpgsqlTypes.LargeObjectManager.READWRITE);
lo.Write(BinaryData);
lo.Close();
transaction.Commit();
return noid;
}
catch
{
transaction.Rollback();
throw;
}
}
I've tried modifying postgresql's memory settings from defaults to all manner of values adjusting:
shared_buffers
work_mem
maintenance_work_mem
So far I've found postgresql to be a great database system, but this is a show stopper at present and I can't seem to get this sized file into the database. I don't really want to have to deal with manually chopping the file into chunks and recreating client side if I can help it.
Please help!?
I think the answer appears to be calling the Write() method of the LargeObject class iteratively with chunks of the byte array. I know I said I didn't want to have to deal with chunking the data, but what I really meant was chunking the data into separate LargeObjects. This solution means I chunk the array, but it is still stored in the database as one object, meaning I don't have to keep track of file parts, just the one oid.
do
{
var length = 1000;
if (i + length > BinaryData.Length) length = BinaryData.Length - i;
byte[] chunk = new byte[length];
Array.Copy(BinaryData, i, chunk, 0, length);
lo.Write(chunk, 0, length);
i += length;
} (i < BinaryData.Length)
Try decreasing the number of max_connections to reserve memory for the few connections that need to do one 700MB operation. Increase the work_mem (which is memory available per-operation) to 1GB. Trying to cram 700MB into one field sounds odd.
Increase the shared_buffers size to 4096MB.

Storing R Objects in a relational database

I frequently create nonparametric statistics (loess, kernel densities, etc) on data I pull out of a relational database. To make data management easier I would like to store R output back inside my DB. This is easy with simple data frames of numbers or text, but I have not figured out how to store R objects back in my relational database. So is there a way to store a vector of kernel densities, for example, back into a relational database?
Right now I work around this by saving the R objects to a network drive space so others can load the objects as needed.
Use the serialization feature to turn any R object into a (raw or character) string, then store that string. See help(serialize).
Reverse this for retrieval: get the string, then unserialize() into a R object.
An example R variable, that's fairly complex:
library(nlme)
model <- lme(uptake ~ conc + Treatment, CO2, random = ~ 1 | Plant / Type)
The best storage database method for R variables depends upon how you want to use it.
I need to do in-database analytics on the values
In this case, you need to break the object down into values that the database can handle natively. This usually means converting it into one or more data frames. The easiest way to do this is to use the broom package.
library(broom)
coefficients_etc <- tidy(model)
model_level_stats <- glance(model)
row_level_stats <- augment(model)
I just want storage
In this case you want to serialize your R variables. That is, converting them to be a string or a binary blob. There are several methods for this.
My data has to be accessible by programs other than R, and needs to be human-readable
You should store your data in a cross-platform text format; probably JSON or YAML. JSON doesn't support some important concepts like Inf; YAML is more general but the support in R isn't as mature. XML is also possible, but is too verbose to be useful for storing large arrays.
library(RJSONIO)
model_as_json <- toJSON(model)
nchar(model_as_json) # 17916
library(yaml)
# yaml package doesn't yet support conversion of language objects,
# so preprocessing is needed
model2 <- within(
model,
{
call <- as.character(call)
terms <- as.character(terms)
}
)
model_as_yaml <- as.yaml(model2)
nchar(model_as_yaml) # 14493
My data has to be accessible by programs other than R, and doesn't need to be human-readable
You could write your data to an open, cross-platform binary format like HFD5. Currently support for HFD5 files (via rhdf5) is limited, so complex objects are not supported. (You'll probably need to unclass everything.)
library(rhdf5)
h5save(rapply(model2, unclass, how = "replace"), file = "model.h5")
bin_h5 <- readBin("model.h5", "raw", 1e6)
length(bin_h5) # 88291 not very efficient in this case
The feather package let's you save data frames in a format readable by both R and Python. To use this, you would first have to convert the model object into data frames, as described in the broom section earlier in the answer.
library(feather)
library(broom)
write_feather(augment(model), "co2_row.feather") # 5474 bytes
write_feather(tidy(model), "co2_coeff.feather") # 2093 bytes
write_feather(glance(model), "co2_model.feather") # 562 bytes
Another alternative is to save a text version of the variable (see previous section) to a zipped file and store its bytes in the database.
writeLines(model_as_json)
tar("model.tar.bz", "model.txt", compression = "bzip2")
bin_bzip <- readBin("model.tar.bz", "raw", 1e6)
length(bin_bzip) # only 42 bytes!
My data only needs to be accessible by R, and needs to be human-readable
There are two options for turning a variable into a string: serialize and deparse.
p <- function(x)
{
paste0(x, collapse = "\n")
}
serialize needs to be sent to a text connection, and rather than writing to file, you can write to the console and capture it.
model_serialized <- p(capture.output(serialize(model, stdout())))
nchar(model_serialized) # 23830
Use deparse with control = "all" to maximise the reversibility when re-parsing later.
model_deparsed <- p(deparse(model, control = "all"))
nchar(model_deparsed) # 22036
My data only needs to be accessible by R, and doesn't need to be human-readable
The same sorts of techniques shown in the previous sections can be applied here. You can zip a serialized or deparsed variable and re-read it as a raw vector.
serialize can also write variables in a binary format. In this case, it is most easily used with its wrapper saveRDS.
saveRDS(model, "model.rds")
bin_rds <- readBin("model.rds", "raw", 1e6)
length(bin_rds) # 6350
For sqlite (and possibly others):
CREATE TABLE data (blob BLOB);
Now in R:
RSQLite::dbGetQuery(db.conn, 'INSERT INTO data VALUES (:blob)', params = list(blob = list(serialize(some_object)))
Note the list wrapper around some_object. The output of serialize is a raw vector. Without list, the INSERT statement would be executed for each vector element. Wrapping it in a list allows RSQLite::dbGetQuery to see it as one element.
To get the object back from the database:
some_object <- unserialize(RSQLite::dbGetQuery(db.conn, 'SELECT blob FROM data LIMIT 1')$blob[[1]])
What happens here is you take the field blob (which is a list since RSQLite doesn't know how many rows will be returned by the query). Since LIMIT 1 assures only 1 row is returned, we take it with [[1]], which is the original raw vector. Then you need to unserialize the raw vector to get your object.
Using textConnection / saveRDS / loadRDS is perhaps the most versatile and high level:
zz<-textConnection('tempConnection', 'wb')
saveRDS(myData, zz, ascii = T)
TEXT<-paste(textConnectionValue(zz), collapse='\n')
#write TEXT into SQL
...
closeAllConnections() #if the connection persists, new data will be appended
#reading back:
#1. pull from SQL into queryResult
...
#2. recover the object
recoveredData <- readRDS(textConnection(queryResult$TEXT))
[100% WORKING - 27 Feb 2020]
Description:
Here are the steps if you want to store your model into a POSTGRES table, then query it and load it. An important part is the ascii = TRUE, which otherwise would produce errors when serializing
db <- pgsql_connect #connection to your database
serialized_model <- rawToChar(serialize(model_fit, NULL, ascii=TRUE))
insert_query <-'INSERT INTO table (model) VALUES ($1)'
rs <- dbSendQuery(db, insert_query, list(serialized_model))
dbClearResult(rs)
serialized_model <- dbGetQuery(db, "select model from table order by created_at desc limit 1")
model_fit2 <- unserialize(charToRaw(as.character(serialized_model[,c('model')])))
model_fit2

key-value store for time series data?

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.

Resources