How to use multiprocessing to create gzip file from dataframe in python - arrays

I have a process that's becoming IO bound where I pull a large dataset from a database into a pandas dataframe and then try to do some line by line processing and then persist to a gzip file. I'm trying to find a way to use multiprocessing to be able to split the creation of the gzip into multiple processes and then merge them into one file. Or process in parallel without overwriting a previous thread. I found this package p_tqdm but i'm running into EOF issues probably because the threads overwrite each other. Here's a sample of my current solution:
from p_tqdm import p_map
df = pd.read_sql(some_sql, engine)
things =[]
for index, row in df.iterrows():
things.append(row)
p_map(process, things)
def process():
with gzip.open("final.gz", "wb") as f:
value = do_somthing(row)
f.write(value.encode())

I don't know about the p_tqdm but if I understand your question, it might be easily done with multiprocessing.
something like this
import multiprocessing
def process(row):
# take care that "do_somthing" must return class with encode() method (e.g. string)
return do_somthing(row)
df = pd.read_sql(some_sql, engine)
things =[]
for index, row in df.iterrows():
things.append(row)
with gzip.open("final.gz", "wb") as f, multiprocessing.Pool() as pool:
for processed_row in pool.imap(process, things):
f.write(processed_row.encode())
Just few sidenotes:
The pandas iterrows method is slow - avoid if possible (see Does pandas iterrows have performance issues?).
Also, you don't need to create things, just pass iterable to imap(even passing df.iterrows() directly should be possible) save yourself some memory.
And finally, since it appears that you are reading sql data, why not connect to the db dicectly and iterate over the cursor from SELECT ... query, skipping pandas altogether.

Related

Quickly update django model objects from pandas dataframe

I have a Django model that records transactions. I need to update only some of the fields (two) of some of the transactions.
In order to update, the user is asked to provide additional data and I use pandas to make calculations using this extra data.
I use the output from the pandas script to update the original model like this:
for i in df.tnsx_uuid:
t = Transactions.objects.get(tnsx_uuid=i)
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
t.save()
this is very slow. What is the best way to do this?
UPDATE:
after some more research, I found bulk_update and changed the code to:
transactions = Transactions.objects.select_for_update()\
.filter(tnsx_uuid__in=list(df.tnsx_uuid)).only('start_bal', 'end_bal')
for t in transactions:
i = t.tnsx_uuid
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
Transactions.objects.bulk_update(transactions, ['start_bal', 'end_bal'])
this has approximately halved the time required.
How can I improve performance further?
I have been looking for the answer to this question and haven't found any authoritative, idiomatic solutions. So, here's what I've settled on for my own use:
transaction = Transactions.objects.filter(tnsx_uuid__in=list(df.tnsx_uuid))
# Build a DataFrame of Django model instances
trans_df = pd.DataFrame([{'tnsx_uuid': t.tnsx_uuid, 'object': t} for t in transactions])
# Join the Django instances to the main DataFrame on the index
df = df.join(trans_df.set_index('tnsx_uuid'))
for obj, start_bal, end_bal in zip(df['object'], df['start_bal'], df['end_bal']):
obj.start_bal = start_bal
obj.end_bal = send_bal
Transactions.objects.bulk_update(df['object'], ['start_bal', 'end_bal'])
I don't know how DataFrame.loc[] is implemented but it could be slow if it needs to search the whole DataFrame for each use rather than just do a hash lookup. For that reason and to just simply things by doing a single iteration loop, I pulled all of the model instances into df and then used the recommendation from a Stackoverflow answer on iterating over a DataFrames to loop over the zipped columns of interest.
I looked at the documentation for select_for_update in Django and it isn't apparent to me that it offers a performance improvement, but you may be using it to lock the transaction and make all of the changes atomically. Per the documentation, bulk_update should be faster than saving each object individually.
In my case, I'm only updating 3500 items. I did some timing of the various steps and came up with the following:
3.05 s to query and build the DataFrame
2.79 ms to join the instances to df
5.79 ms to run the for loop and update the instances
1.21 s to bulk_update the changes
So, I think you would need to profile your code to see what is actually taking time, but it is likely a Django issue rather than a Pandas issue.
I kind of face the same issue (almost same quantity of records 3500~), and I will like to add:
bulk_update seems to be a lot worse in performance than a
bulk_create, in my case deleting objects was allowed, so
instead of bulk_updating, I delete all objects, and then recreate them.
I used the same approach as you (thanks for the idea), but with some modifications:
a) I create the dataframe from the query itself:
all_objects_values = all_objects.values('id', 'date', 'amount')
self.df_values = pd.DataFrame.from_records(all_objects_values )
b) Then I create the column of objects without iterating (I make sure these are ordered):
self.df_values['object'] = list(all_objects)
c) For updating object values (after operations made in my dataframe), I iterate rows(not sure about performance difference):
for index, row in self.df_values.iterrows():
row['object'].amount= row['amount']
d) At the end, I re-create all objects:
MyModel.objects.bulk_create(self.df_values['object'].tolist())
Conclusion:
In my case, the most time consuming was the bulk update, so re-creating objects solved it for me (from 19 seconds with bulk_update to 10 seconds with delete + bulk_create)
In your case, using my approach may improve the time for all other operations.

How to Create Relationship between two Different Column in Neo4j

I am trying to initiate a relationship between two columns in Neo4j. my dataset is a CSV file with two-column refers to Co-Authorship and I want to Construct a Network of it. I already load the data, return them and match them.
Loading
load csv from 'file:///conet1.csv' as rec
return the data
create (:Guys {source: rec[0], target: rec[1]})
now I need to Construct the Collaboration Network of data by making a relationship between source and target columns. What do you propose for the purpose?
I was able to make a relationship between mentioned columns in NetworkX graph libray in python like this:
import pandas as pd
import networkx as nx
g = nx.Graph()
df = pd.read_excel('Colab.csv', columns= ['source', 'target'])
g = nx.from_pandas_edgelist(df,'source','target', 'weight')
If I understand your use case, I do not believe you should be creating Guys nodes just to store relationship info. Instead, the graph-oriented approach would be to create an Author node for each author and a relationship (say, of type COLLABORATED_WITH) between the co-authors.
This might work for you, or at least give you a clue:
LOAD CSV FROM 'file:///conet1.csv' AS rec
MERGE (source:Author {id: rec[0]})
MERGE (target:Author {id: rec[1]})
CREATE (source)-[:COLLABORATED_WITH]->(target)
If it is possible that the same relationship could be re-created, you should replace the CREATE with a more expensive MERGE. Also, a work can have any number of co-authors, so having a relationship between every pair may be sub-optimal depending on what you are trying to do; but that is a separate issue.

How to generate training dataset from huge binary data using TF 2.0?

I have a binary dataset of over 15G. I want to extract the data for model training using TF 2.0. Currently here is what I am doing:
import numpy as np
import tensorflow as tf
data1 = np.fromfile('binary_file1', dtype='uint8')
data2 = np.fromfile('binary_file2', dtype='uint8')
dataset = tf.data.Dataset.from_tensor_slices((data1, data2))
# then do something like batch, shuffle, prefetch, etc.
for sample in dataset:
pass
but this consumes my memory and I don't think it's a good way to deal with such big files. What should I do to deal with this problem?
You have to make use of FixedLengthRecordDataset to read binary files. Though this dataset accepts several files together, it will extract data only one file at a time. Since you wish to have data to be read from two binary files simultaneously, you will have to create two FixedLengthRecordDataset and then zip them.
Another thing to note down is you have to specify parameter of record_bytes which states how many bytes would you like to read per iteration. This has to be an exact multiple of the total bytes of your binary file size.
ds1 = tf.data.FixedLengthRecordDataset('file1.bin', record_bytes=1000)
ds2 = tf.data.FixedLengthRecordDataset('file2.bin', record_bytes=1000)
ds = tf.data.Dataset.zip((ds1, ds2))
ds = ds.batch(32)

Server out-of-memory issue when using RJDBC in paralel computing environment

I have an R server with 16 cores and 8Gb ram that initializes a local SNOW cluster of, say, 10 workers. Each worker downloads a series of datasets from a Microsoft SQL server, merges them on some key, then runs analyses on the dataset before writing the results to the SQL server. The connection between the workers and the SQL server runs through a RJDBC connection. When multiple workers are getting data from the SQL server, ram usage explodes and the R server crashes.
The strange thing is that the ram usage by a worker loading in data seems disproportionally large compared to the size of the loaded dataset. Each dataset has about 8000 rows and 6500 columns. This translates to about 20MB when saved as an R object on disk and about 160MB when saved as a comma-delimited file. Yet, the ram usage of the R session is about 2,3 GB.
Here is an overview of the code (some typographical changes to improve readability):
Establish connection using RJDBC:
require("RJDBC")
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")
After this there is some code that sorts the function input vector requestedDataSets with names of all tables to query by number of records, such that we load the datasets from largest to smallest:
nrow.to.merge <- rep(0, length(requestedDataSets))
for(d in 1:length(requestedDataSets)){
nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1]
}
merge.order <- order(nrow.to.merge,decreasing = T)
We then go through the requestedDatasets vector and load and/or merge the data:
for(d in merge.order){
# force reconnect to SQL server
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
try(dbDisconnect(con), silent = T)
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>")
# remove the to.merge object
rm(complete.data.to.merge)
# force garbage collection
gc()
jgc()
# ask database for dataset d
complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d]))
# first dataset
if (d == merge.order[1]){
complete.data <- complete.data.to.merge
colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1"
}
# later dataset
else {
complete.data <- merge(
x = complete.data,
y = complete.data.to.merge,
by.x = "key_1", by.y = "key", all.x=T)
}
}
return(complete.data)
When I run this code on a serie of twelve datasets, the number of rows/columns of the complete.data object is as expected, so it is unlikely the merge call somehow blows up the usage. For the twelve iterations memory.size() returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217, and 2361. Which, again, is strange as the dataset at the end is at most 162 MB...
As you can see in the code above I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection jgc <- function(){.jcall("java/lang/System", method = "gc")}). I've also tried merging the data SQL-server-side, but then I run into number of columns constraints.
It vexes me that the RAM usage is so much bigger than the dataset that is eventually created, leading me to believe there is some sort of buffer/heap that is overflowing... but I seem unable to find it.
Any advice on how to resolve this issue would be greatly appreciated. Let me know if (parts of) my problem description are vague or if you require more information.
Thanks.
This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2) storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.
I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...
You can't force a garbage collection in Java, calling System.gc() only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.

Storing R Objects in a relational database

I frequently create nonparametric statistics (loess, kernel densities, etc) on data I pull out of a relational database. To make data management easier I would like to store R output back inside my DB. This is easy with simple data frames of numbers or text, but I have not figured out how to store R objects back in my relational database. So is there a way to store a vector of kernel densities, for example, back into a relational database?
Right now I work around this by saving the R objects to a network drive space so others can load the objects as needed.
Use the serialization feature to turn any R object into a (raw or character) string, then store that string. See help(serialize).
Reverse this for retrieval: get the string, then unserialize() into a R object.
An example R variable, that's fairly complex:
library(nlme)
model <- lme(uptake ~ conc + Treatment, CO2, random = ~ 1 | Plant / Type)
The best storage database method for R variables depends upon how you want to use it.
I need to do in-database analytics on the values
In this case, you need to break the object down into values that the database can handle natively. This usually means converting it into one or more data frames. The easiest way to do this is to use the broom package.
library(broom)
coefficients_etc <- tidy(model)
model_level_stats <- glance(model)
row_level_stats <- augment(model)
I just want storage
In this case you want to serialize your R variables. That is, converting them to be a string or a binary blob. There are several methods for this.
My data has to be accessible by programs other than R, and needs to be human-readable
You should store your data in a cross-platform text format; probably JSON or YAML. JSON doesn't support some important concepts like Inf; YAML is more general but the support in R isn't as mature. XML is also possible, but is too verbose to be useful for storing large arrays.
library(RJSONIO)
model_as_json <- toJSON(model)
nchar(model_as_json) # 17916
library(yaml)
# yaml package doesn't yet support conversion of language objects,
# so preprocessing is needed
model2 <- within(
model,
{
call <- as.character(call)
terms <- as.character(terms)
}
)
model_as_yaml <- as.yaml(model2)
nchar(model_as_yaml) # 14493
My data has to be accessible by programs other than R, and doesn't need to be human-readable
You could write your data to an open, cross-platform binary format like HFD5. Currently support for HFD5 files (via rhdf5) is limited, so complex objects are not supported. (You'll probably need to unclass everything.)
library(rhdf5)
h5save(rapply(model2, unclass, how = "replace"), file = "model.h5")
bin_h5 <- readBin("model.h5", "raw", 1e6)
length(bin_h5) # 88291 not very efficient in this case
The feather package let's you save data frames in a format readable by both R and Python. To use this, you would first have to convert the model object into data frames, as described in the broom section earlier in the answer.
library(feather)
library(broom)
write_feather(augment(model), "co2_row.feather") # 5474 bytes
write_feather(tidy(model), "co2_coeff.feather") # 2093 bytes
write_feather(glance(model), "co2_model.feather") # 562 bytes
Another alternative is to save a text version of the variable (see previous section) to a zipped file and store its bytes in the database.
writeLines(model_as_json)
tar("model.tar.bz", "model.txt", compression = "bzip2")
bin_bzip <- readBin("model.tar.bz", "raw", 1e6)
length(bin_bzip) # only 42 bytes!
My data only needs to be accessible by R, and needs to be human-readable
There are two options for turning a variable into a string: serialize and deparse.
p <- function(x)
{
paste0(x, collapse = "\n")
}
serialize needs to be sent to a text connection, and rather than writing to file, you can write to the console and capture it.
model_serialized <- p(capture.output(serialize(model, stdout())))
nchar(model_serialized) # 23830
Use deparse with control = "all" to maximise the reversibility when re-parsing later.
model_deparsed <- p(deparse(model, control = "all"))
nchar(model_deparsed) # 22036
My data only needs to be accessible by R, and doesn't need to be human-readable
The same sorts of techniques shown in the previous sections can be applied here. You can zip a serialized or deparsed variable and re-read it as a raw vector.
serialize can also write variables in a binary format. In this case, it is most easily used with its wrapper saveRDS.
saveRDS(model, "model.rds")
bin_rds <- readBin("model.rds", "raw", 1e6)
length(bin_rds) # 6350
For sqlite (and possibly others):
CREATE TABLE data (blob BLOB);
Now in R:
RSQLite::dbGetQuery(db.conn, 'INSERT INTO data VALUES (:blob)', params = list(blob = list(serialize(some_object)))
Note the list wrapper around some_object. The output of serialize is a raw vector. Without list, the INSERT statement would be executed for each vector element. Wrapping it in a list allows RSQLite::dbGetQuery to see it as one element.
To get the object back from the database:
some_object <- unserialize(RSQLite::dbGetQuery(db.conn, 'SELECT blob FROM data LIMIT 1')$blob[[1]])
What happens here is you take the field blob (which is a list since RSQLite doesn't know how many rows will be returned by the query). Since LIMIT 1 assures only 1 row is returned, we take it with [[1]], which is the original raw vector. Then you need to unserialize the raw vector to get your object.
Using textConnection / saveRDS / loadRDS is perhaps the most versatile and high level:
zz<-textConnection('tempConnection', 'wb')
saveRDS(myData, zz, ascii = T)
TEXT<-paste(textConnectionValue(zz), collapse='\n')
#write TEXT into SQL
...
closeAllConnections() #if the connection persists, new data will be appended
#reading back:
#1. pull from SQL into queryResult
...
#2. recover the object
recoveredData <- readRDS(textConnection(queryResult$TEXT))
[100% WORKING - 27 Feb 2020]
Description:
Here are the steps if you want to store your model into a POSTGRES table, then query it and load it. An important part is the ascii = TRUE, which otherwise would produce errors when serializing
db <- pgsql_connect #connection to your database
serialized_model <- rawToChar(serialize(model_fit, NULL, ascii=TRUE))
insert_query <-'INSERT INTO table (model) VALUES ($1)'
rs <- dbSendQuery(db, insert_query, list(serialized_model))
dbClearResult(rs)
serialized_model <- dbGetQuery(db, "select model from table order by created_at desc limit 1")
model_fit2 <- unserialize(charToRaw(as.character(serialized_model[,c('model')])))
model_fit2

Resources