Iterate over big resultset in batches (like foreach, but grouped) - scalikejdbc

I am using ScalikeJDBC to fetch a large table, convert the data to JSON and then calling a web service with 50 JSON objects (rows) each. This is my code:
val rows = sql"SELECT * FROM bigtable"
val jsons = rows.map { row =>
// build JSON object for each row
}.toList().apply()
jsons.grouped(50).foreach { batch =>
// send 50 objects at once to an HTTP server
}
This works, but unfortunately, the intermediate list is huge and consumes alot of memory. I am looking for a way to iterate over the resultset in a "lazy" fashion, similar to foreach, except I want to iterate over batches of 50 rows. Is that possible with ScalikeJDBC?
I solved the memory issues by filling and clearing a mutable list instead of using grouped, but I am still looking for a better solution.

Try specifying fetchSize.
See also: http://scalikejdbc.org/documentation/operations.html#setting-jdbc-fetchsize

Related

Quickly update django model objects from pandas dataframe

I have a Django model that records transactions. I need to update only some of the fields (two) of some of the transactions.
In order to update, the user is asked to provide additional data and I use pandas to make calculations using this extra data.
I use the output from the pandas script to update the original model like this:
for i in df.tnsx_uuid:
t = Transactions.objects.get(tnsx_uuid=i)
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
t.save()
this is very slow. What is the best way to do this?
UPDATE:
after some more research, I found bulk_update and changed the code to:
transactions = Transactions.objects.select_for_update()\
.filter(tnsx_uuid__in=list(df.tnsx_uuid)).only('start_bal', 'end_bal')
for t in transactions:
i = t.tnsx_uuid
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
Transactions.objects.bulk_update(transactions, ['start_bal', 'end_bal'])
this has approximately halved the time required.
How can I improve performance further?
I have been looking for the answer to this question and haven't found any authoritative, idiomatic solutions. So, here's what I've settled on for my own use:
transaction = Transactions.objects.filter(tnsx_uuid__in=list(df.tnsx_uuid))
# Build a DataFrame of Django model instances
trans_df = pd.DataFrame([{'tnsx_uuid': t.tnsx_uuid, 'object': t} for t in transactions])
# Join the Django instances to the main DataFrame on the index
df = df.join(trans_df.set_index('tnsx_uuid'))
for obj, start_bal, end_bal in zip(df['object'], df['start_bal'], df['end_bal']):
obj.start_bal = start_bal
obj.end_bal = send_bal
Transactions.objects.bulk_update(df['object'], ['start_bal', 'end_bal'])
I don't know how DataFrame.loc[] is implemented but it could be slow if it needs to search the whole DataFrame for each use rather than just do a hash lookup. For that reason and to just simply things by doing a single iteration loop, I pulled all of the model instances into df and then used the recommendation from a Stackoverflow answer on iterating over a DataFrames to loop over the zipped columns of interest.
I looked at the documentation for select_for_update in Django and it isn't apparent to me that it offers a performance improvement, but you may be using it to lock the transaction and make all of the changes atomically. Per the documentation, bulk_update should be faster than saving each object individually.
In my case, I'm only updating 3500 items. I did some timing of the various steps and came up with the following:
3.05 s to query and build the DataFrame
2.79 ms to join the instances to df
5.79 ms to run the for loop and update the instances
1.21 s to bulk_update the changes
So, I think you would need to profile your code to see what is actually taking time, but it is likely a Django issue rather than a Pandas issue.
I kind of face the same issue (almost same quantity of records 3500~), and I will like to add:
bulk_update seems to be a lot worse in performance than a
bulk_create, in my case deleting objects was allowed, so
instead of bulk_updating, I delete all objects, and then recreate them.
I used the same approach as you (thanks for the idea), but with some modifications:
a) I create the dataframe from the query itself:
all_objects_values = all_objects.values('id', 'date', 'amount')
self.df_values = pd.DataFrame.from_records(all_objects_values )
b) Then I create the column of objects without iterating (I make sure these are ordered):
self.df_values['object'] = list(all_objects)
c) For updating object values (after operations made in my dataframe), I iterate rows(not sure about performance difference):
for index, row in self.df_values.iterrows():
row['object'].amount= row['amount']
d) At the end, I re-create all objects:
MyModel.objects.bulk_create(self.df_values['object'].tolist())
Conclusion:
In my case, the most time consuming was the bulk update, so re-creating objects solved it for me (from 19 seconds with bulk_update to 10 seconds with delete + bulk_create)
In your case, using my approach may improve the time for all other operations.

How to retrieve the rows from starting and ending index in firebase database

I am trying pagination feature in my project, I am using react with firebase.
My problem statement: if I have around 50 records in my database, what is the query if we want to select records 3 - 8?
If you want to use specific fields in front-end side then use slice and i think you can return your all list from backend and just slice from ui side.
let arr = [
{id:1,name:'One'},
{id:2,name:'Two'},
{id:3,name:'Three'},
{id:4,name:'Four'},
{id:5,name:'Five'},
{id:6,name:'Six'},
{id:7,name:'Seven'},
{id:8,name:'Eight'},
{id:9,name:'Nine'},
{id:10,name:'Ten'}
]
let selected = arr.slice(2,8);
console.log(selected)
Firebase Realtime Database doesn't have numeric index pagination. The API allows you to start at the beginning and page through the query results using startAt(), limitToFirst(), and other related methods, as described in the documentation.
If you absolutely must use indexes, you will have to read the entire set of results, populate an array, and pick out the desired results from that array using the indexes you want on the page.

Initializing long[] for DeleteRows() method in Smartsheet API

I'm trying to delete a set of records using the following C# SDK Smartsheet API 2.0 method:
long[] deleteRowIds = existingRowIds.Except(updatedRowIds).ToArray();
smartsheet.SheetResources.RowResources.DeleteRows(sheetId, deleteRowIds, true)
Within the smartsheet documentation the row id parameter example is as follows:
smartsheet.SheetResources.RowResources.DeleteRows( sheetId, new long[] { 207098194749316, 207098194749317 }, true)
I hardcoded the row ids relevant to my sheet and was able to execute the method. However, when I try to push the array of Ids I'm generating in my first line of code I'm receiving this error: "There was an issue connecting".
I can't find that error in any of their documentation. Is there a chance I'm misunderstanding how my long[] variable is initialized from List by using the ToArray() method?
That's really my only theory (as I've exported all my row ids to ensure I'm not pushing an incorrect data type).
Any help would be greatly appreciated.
Thanks!
Channing
Looks like the Delete method bulk operation has a limit on the amount of row ids I can pass into the long[] parameter. The limit is somewhere between 400 - 500 row ids. I'll partition these in order to bypass the limit.

Get database last N data points from each node (Cloudant/couchdb)

TL;DR: MapReduce or POST request?
What is the correct(=most efficient) way to fetch the latest n data points of multiple sensors, from Cloudant or equivalent database?
Sensor data is stored in individual documents like this:
{
"_id": "2d26dbd8e655ae02bdab611afc92b6cf",
"_rev": "1-a64448521f05935b915e4bee12328e84",
"date": "2017-06-20T15:59:50.509Z",
"name": "Sensor01",
"temperature": 24.5,
"humidity": 45.3,
"rssi": -33
}
I want the fetch the latest 10 documents from sensor01-sensor99 so I can feed it to UI.
I have discovered few options:
1. Use map reduce function
Reduce each sensor data to array under sensor01, sensor02, etc...
E.g.
Map:
function (doc) {
if (doc.name && doc.temperature) emit(doc.name, doc.temperature);
}
Reduce:
function (keys, values, rereduce) {
var temp_arr=[];
for (i=0;i<values.length;i++)
{
temp_arr.push(values);
}
return temp_arr;
}
I couldn't get this to work, but I think the method should be viable.
2. Multi-document fetching
{
"queries":[
{sensor01},{sensor02},{sensor03} etc....
]};
Where each {sensor0x} is filtered using
{"startkey": [sensors[i],{}],"endkey": [sensors[i]],"limit": 5}
This way I can order documents using ?descending=true
I implemented it and it works. I have my doubts should I use this if I have 1000 sensors with 10000 data points each.
And for hundreds of sensors I need to send a very large POST request.
Something better?
Is my architecture even correct?
Storing sensor data individual documents, and then fill the UI by fetching all data through REST API.
Thank you very much!
There's nothing wrong with your method of storing one reading per document, but there's no truly efficient way of getting "the last n data points" for a number of sensors.
We could create a MapReduce function:
function (doc) {
if (doc.name && doc.temperature && doc.date) {
emit([doc.name, doc.date], doc.temperature);
}
}
This creates an indexed ordered on name and date.
We can access the most recent readings for a single sensor by querying the view:
/_design/mydesigndoc/_view/myview?startkey=["Sensor01","2020-01-01"]&descending=true&limit=10
This fetches readings for "Sensor01" in newest-first order:
startkey & endkey are reveresed when doing descending=true
descending= true means in reverese order
limit - the number of readings required (or n in your parlance)
This is a very efficient use of Cloudant/CouchDB but it only returns the last n readings for single sensor. To retrieve other sensors' data, additional API calls would be required.
Creating an index like this:
function (doc) {
if (doc.name && doc.temperature && doc.date) {
emit(doc.date, doc.temperature);
}
}
orders each reading by date. You can then retrieve the newest n readings with:
/_design/mydesigndoc/_view/myview?startkey="2020-01-01"&descending=true&limit=200
If all of your sensors are saving data at the same rate, then simply using a larger limit should get your the latest readings of all sensors.
This too is an efficient use of CouchDB/Cloudant.
You may also want to look at the built-in reducers (_count, _sum and _stats) to get the database to aggregate readings for you. They are a great way to create year/month/day groupings of IoT data.
In general, I would recommend not using custom reducers they are many times more inefficient than the built-in reducers which are written in Erlang.

MongoDB - Exclude an inner array but include it's count

I have a table called devicesegments, each row of which contains a large array called devices. Owing to the size of the device array, I've been asked not to include it in my query for a page that lists all devicesegments, but only include their count. Is this possible?
What I was doing before :
A simple db.devicesegments.find()
What I'm doing now :
db.devicesegments.find({}, { devices : 0 })
What I want to achieve :
db.devicesegments.find({}, { devices : 0, devices.length : 1 })
Something like a COUNT(devices) AS device_count!
Ashkay, there's no way to do this with Mongo currently. As #rompetroll says, your application should keep a "count" field on each document, and carefully $inc it whenever you change the number of entries in the array. Then when you query for the document, exclude the array like:
db.collection.find({}, {devices:0})
If you're willing to run MongoDB 2.1, which is a development release, the aggregation framework provides a means to calculate the array size within a query:
http://www.mongodb.org/display/DOCS/Aggregation+Framework
Since there is no way to currently do this, without including a new device_count in my table, the temporary fix that I applied was to fetch all the data from the database, along with the devices array, and for each row, add a field for devices.length and then remove the devices array before sending the data across.

Resources