Nodejs: Batch insert of large number of rows into a database - database

I want to process a large number of records (>400k) in a batch and insert them into a database.
I know how to iterate over an array with for() or underscore.each() and also I know how to insert a record into various (no)SQL databases asynchronously. That's not the problem - the problem is I can't figure a way to do both at the same time.
The database distribution itself doesn't play a role here, the principle applies for any (NO)SQL database with an async interface.
I'm looking for a pattern to solve the following problem:
The loop approach:
var results = []; //imagine 100k objects here
_.each(results,function(row){
var newObj = prepareMyData(row);
db.InsertQuery(newObj,function(err,response) {
if(!err) console.log('YAY, inserted successfully');
});
});
This approach is obviously flawed. It kinda hammers the database with insert queries without waiting for a single one to finish. Speaking about MySQL adapters using a connection pool, you pretty soon run out of connections and the script fails.
The recursion approach:
var results = []; //again, full of BIGDATA ;)
var index = 0;
var myRecursion = function()
{
var row = results[index];
var data = prepareMyData(row);
db.InsertQuery(data,function(err, response)
{
if (!err)
{
console.log('YAY, inserted successfully!');
index++; //increment for the next recursive call of:
if (index < results.length) myRecursion();
}
}
}
myRecursion();
While this approach works pretty well for small chunks of data (though it can be slow, but that's ok. the event loop can rest a while, waiting for the query to finish), it work doesn't for large arrays - too many recursions.
I could write a batch insert easily in any other procedural language like PHP or so, but I don't want to. I want to solve this, asynchronously, in nodejs - for educational purposes.
Any suggestions?

I've found a solution that works for me, but I'm still interested in understanding how this technically works.
Reading the node-async docs I found a few functions to achieve this:
async.map //iterates over an array
async.each //iterates over an array in parallel
async.eachSeries //iterates over an array sequentially
async.eachLimit //iterates over an array in parallel with n (limit) parallel calls.
For instance:
var results = []; //still huge array
// "4" means, async will fire the iterator function up to 4 times in parallel
async.eachLimit(results,4,function(row,cb){
var data = prepareMyData(row);
db.InsertQuery(data,function(err, response)
{
if (!err)
{
cb(err,response);
}
}
},function(err,res)
{
console.log('were done!');
});

Related

Google Apps Script: Use array out of spreadsheet

I try to use Google Script Apps (instead of VBA which I am more used to) and managed now to create a loop over different spreadsheets (and not only different sheets in one document) using the forEach function.
(I tried with a for (r=1;r=lastRow; r++) but I did not manage).
It is working now defining the array for the sheetnames manually:
var SheetList = ["17DCu1nyyX4a6zCkkT3RfBSfo-ghoc2fXEX8chlVMv5k", "1rRGQHs_JShPSBIGFCdG6AqXM967JFhdlfQ92cf5ISL8", "1pFDyXgYmvC5gnN5AU5xJ8vGiihwtubcbG2n4LPhPACQ", "1mK_X4Q7ysJQTt8NZoZASBE5zuUllPmmWSJsxu5Dnu9Y", "1FpjIGWTG5_6MMYJF72wvoiBRp_Xlt5BDpzvSZKcsU"]
And then for information the loop:
SheetList.forEach(function(r) {
var thisSpreadsheet = SpreadsheetApp.openById(r)
var thisData = thisSpreadsheet.getSheetByName('Actions').getDataRange()
var values = thisData.getValues();
var toWorksheet = targetSpreadsheetID.getSheetByName(targetWorksheetName);
var last = toWorksheet.getLastRow ()+ 1
var toRange = toWorksheet.getRange(last, 1, thisData.getNumRows(), thisData.getNumColumns())
toRange.setValues(values);
})
Now I want to create the definition of the array "automatically" out of the spreadsheet 'List' where all spreadsheets which I want to loop are listed in column C.
I tried several ideas, but always failed.
Most optimistic ones were:
var SheetList = targetSpreadsheetID.getSheetByName('List').getRange(2,3,lastRow-2,3).getValues()
And I also tried with the array-function:
var sheetList=Array.apply(targetSpreadsheetID.getSheetByName('List').getRange(2,3,lastRow-2,3))
but all without success.
It should be possible normally in more or less one single line to import the array from the speadsheet to the Google apps scripts?
I would very much appreciate if someone could please give me a hint where my mistake is.
Thank you very much.
Maria
I still did not manage to put the array as I wanted it initially, but now I found a workable solution with the For Loop which I want to share here in case someone is looking for a similar solution (and then finds at least my workaround ;) )
for (i=2; i<lastRow;i++){
var SheetList = targetSpreadsheetID.getSheetByName('List').getRange(i,3).getValues()
Logger.log(SheetList);
var thisSpreadsheet = SpreadsheetApp.openById(SheetList);
... // the rest identical to loop above...
Don't hesitate to add your comments or advice anyhow, but I will mark the question as closed.
Thanks a lot.
Maria

Cypress get length of a table dynamically

I'm trying to test the length of a react table.
In particular I would want that after a user add a new row to the table it's length would increase by one
In my code I can't extract the length value of the table and save it in the variable rowsLength so that it can be checked later, what am I doing wrong?
It seems rowsLength is an object not a number..
const rowsLength = cy.get(".MuiTableBody-root").find("tr").its("length");
cy.get("#addRowButton").click();//new tr inserted
cy.get(".MuiTableBody-root").find("tr").should("have.length", rowsLength + 1);
Setting a global variable quite often will not work, because of the asynchronous nature of a web app. The docs refer to this as doing a backflip.
The most reliable way to handle this is either with a closure or with an alias.
Closures
To access what each Cypress command yields you use .then().
cy.get(".MuiTableBody-root").find("tr").its('length').then(intialLength => {
// Code that uses initialLength is nested inside `.then()`
cy.get("#addRowButton").click();
cy.get(".MuiTableBody-root").find("tr").should("have.length", intialLength + 1);
})
Aliases
Allows you to separate the code that creates the variable from the code that uses it.
cy.get(".MuiTableBody-root").find("tr").its('length').as('intialLength');
cy.get("#addRowButton").click();
// other test code
cy.get('#intialLength').then(intialLength => {
cy.get(".MuiTableBody-root").find("tr").should("have.length", intialLength + 1);
})
Why not a global variable?
Using a global variable is anti-pattern because it often creates a race condition between the test code and the queued commands.
By luck, your example will work most times because cy.get("#addRowButton").click() is a slow operation.
But if you want to test that the row count has not changed, then using a global variable will fail.
Try this
var intialLength = 0;
cy.get('table').find("tr").then((tr) => {
intialLength = tr.length;
})
// Test that row count remains the same
cy.get('table').find("tr").should("have.length", intialLength);
// fails because it runs before the variable is set
This is because the test code runs much faster than the .get() and .find() commands, which must access the DOM.
Although the above example is trivial, remember that web apps are inherently asynchronous and backflip tests can easily fail.
With find() you can do something like:
var tableLen = 0;
cy.get(".MuiTableBody-root").find("tr").then((tr) => {
tableLen = tr.length,
})
and then:
cy.get("#addRowButton").click(); //new tr inserted
cy.get(".MuiTableBody-root").find("tr").should("have.length", tableLen + 1);

Best practices to execute faster a CasperJS script that scrapes thousands of pages

I've written a CasperJS script that works very well except that it takes a (very very) long time to scrape pages.
In a nutshell, here's the pseudo code:
my functions to scrape the elements
my casper.start() to start the navigation and log in
casper.then() where I loop through an array and store my links
casper.thenOpen() to open each link and call my functions to scrap.
It works perfectly (and fast enough) for scraping a bunch of links. But when it comes to thousands (right now I'm running the script with an array of 100K links), the execution time is endless: the first 10K links have been scrapped in 3h54m10s and the following 10K in 2h18m27s.
I can explain a little bit the difference between the two 10K batches : the first includes the looping & storage of the array with the 100K links. From this point, the scripts only open pages to scrap them. However, I noticed the array was ready to go after roughly 30 minutes so it doesn't explain exactly the time gap.
I've placed my casper.thenOpen() in the for loop hoping that after each new link built and stored in the array, the scrapping will happen. Now, I'm sure I've failed this but will it change anything in terms of performance ?
That's the only lead I have in mind right now and I'd be very thankful if anyone is willing to share his/her best practices to reduce significantly the running time of the script's execution (shouldn't be hard!).
EDIT #1
Here's my code below:
var casper = require('casper').create();
var fs = require('fs');
// This array maintains a list of links to each HOL profile
// Example of a valid URL: https://myurl.com/list/74832
var root = 'https://myurl.com/list/';
var end = 0;
var limit = 100000;
var scrapedRows = [];
// Returns the selector element property if the selector exists but otherwise returns defaultValue
function querySelectorGet(selector, property, defaultValue) {
var item = document.querySelector(selector);
item = item ? item[property] : defaultValue;
return item;
}
// Scraping function
function scrapDetails(querySelectorGet) {
var info1 = querySelectorGet("div.classA h1", 'innerHTML', 'N/A').trim()
var info2 = querySelectorGet("a.classB span", 'innerHTML', 'N/A').trim()
var info3 = querySelectorGet("a.classC span", 'innerHTML', 'N/A').trim()
//For scraping different texts of the same kind (i.e: comments from users)
var commentsTags = document.querySelectorAll('div.classComments');
var comments = Array.prototype.map.call(commentsTags, function(e) {
return e.innerText;
})
// Return all the rest of the information as a JSON string
return {
info1: info1,
info2: info2,
info3: info3,
// There is no fixed number of comments & answers so we join them with a semicolon
comments : comments.join(' ; ')
};
}
casper.start('http://myurl.com/login', function() {
this.sendKeys('#username', 'username', {keepFocus: true});
this.sendKeys('#password', 'password', {keepFocus: true});
this.sendKeys('#password', casper.page.event.key.Enter, {keepFocus: true});
// Logged In
this.wait(3000,function(){
//Verify connection by printing welcome page's title
this.echo( 'Opened main site titled: ' + this.getTitle());
});
});
casper.then( function() {
//Quick summary
this.echo('# of links : ' + limit);
this.echo('scraping links ...')
for (var i = 0; i < limit; i++) {
// Building the urls to visit
var link = root + end;
// Visiting pages...
casper.thenOpen(link).then(function() {
// We pass the querySelectorGet method to use it within the webpage context
var row = this.evaluate(scrapDetails, querySelectorGet);
scrapedRows.push(row);
// Stats display
this.echo('Scraped row ' + scrapedRows.length + ' of ' + limit);
});
end++;
}
});
casper.then(function() {
fs.write('infos.json', JSON.stringify(scrapedRows), 'w')
});
casper.run( function() {
casper.exit();
});
At this point I probably have more questions than answers but let's try.
Is there a particular reason why you're using CasperJS and not Curl for example ? I can understand the need for CasperJS if you are going to scrape a site that uses Javascript for example. Or you want to take screenshots. Otherwise I would probably use Curl along with a scripting language like PHP or Python and take advantage of the built-in DOM parsing functions.
And you can of course use dedicated scraping tools like Scrapy. There are quite a few tools available.
Then the 'obvious' question: do you really need to have arrays that large ? What you are trying to achieve is not clear, I am assuming you will want to store the extracted links to a database or something. Isn't it possible to split the process in small batches ?
One thing that should help is to allocate sufficient memory by declaring a fixed-size array ie:
var theArray = new Array(1000);
Resizing the array constantly is bound to cause performance issues. Every time new items are added to the array, expensive memory allocation operations must take place in the background, and are repeated as the loop is being run.
Since you are not showing any code, so we cannot suggest meaningful improvements, just generalities.

Action to spawn multiple further actions in Gatling scenario

Background
I'm currently working on a capability analysis set of stress-testing tools for which I'm using gatling.
Part of this involves loading up an elasticsearch with scroll queries followed by update API calls.
What I want to achieve
Step 1: Run the scroll initiator and save the _scroll_id where it can be used by further scroll queries
Step 2: Run a scroll query on repeat, as part of each scroll query make a modification to each hit returned and index it back into elasticsearch, effectively spawning up to 1000 Actions from the one scroll query action, and having the results sampled.
Step 1 is easy. Step 2 not so much.
What I've tried
I'm currently trying to achieve this via a ResponseTransformer that parses JSON-formatted results, makes modifications to each one and fires off a thread for each one that attempts another exec(http(...).post(...) etc) to index the changes back into elasticsearch.
Basically, I think I'm going about it the wrong way about it. The indexing threads never get run, let alone sampled by gatling.
Here's the main body of my scroll query action:
...
val pool = Executors.newFixedThreadPool(parallelism)
val query = exec(http("Scroll Query")
.get(s"/_search/scroll")
.body(ElFileBody("queries/scrollquery.json")).asJSON // Do the scroll query
.check(jsonPath("$._scroll_id").saveAs("scroll_id")) // Get the scroll ID from the response
.transformResponse { case response if response.isReceived =>
new ResponseWrapper(response) {
val responseJson = JSON.parseFull(response.body.string)
// Get the hits and
val hits = responseJson.get.asInstanceOf[Map[String, Any]]("hits").asInstanceOf[Map[String,Any]]("hits").asInstanceOf[List[Map[String, Any]]]
for (hit <- hits) {
val id = hit.get("_id").get.asInstanceOf[String]
val immutableSource = hit.get("_source").get.asInstanceOf[Map[String, Any]]
val source = collection.mutable.Map(immutableSource.toSeq: _*) // Make the map mutable
source("newfield") = "testvalue" // Make a modification
Thread.sleep(pause) // Pause to simulate topology throughput
pool.execute(new DocumentIndexer(index, doctype, id, source)) // Create a new thread that executes the index request
}
}
}) // Make some mods and re-index into elasticsearch
...
DocumentIndexer looks like this:
class DocumentIndexer(index: String, doctype: String, id: String, source: scala.collection.mutable.Map[String, Any]) extends Runnable {
...
val httpConf = http
.baseURL(s"http://$host:$port/${index}/${doctype}/${id}")
.acceptHeader("application/json")
.doNotTrackHeader("1")
.disableWarmUp
override def run() {
val json = new ObjectMapper().writeValueAsString(source)
exec(http(s"Index ${id}")
.post("/_update")
.body(StringBody(json)).asJSON)
}
}
Questions
Is this even possible using gatling?
How can I achieve what I want to achieve?
Thanks for any help/suggestions!
It's possible to achieve this by using jsonPath to extract the JSON hit array and saving the elements into the session and then, using a foreach in the action chain and exec-ing the index task in the loop you can perform the indexing accordingly.
ie:
ScrollQuery
...
val query = exec(http("Scroll Query")
.get(s"/_search/scroll")
.body(ElFileBody("queries/scrollquery.json")).asJSON // Do the scroll query
.check(jsonPath("$._scroll_id").saveAs("scroll_id")) // Get the scroll ID from the response
.check(jsonPath("$.hits.hits[*]").ofType[Map[String,Any]].findAll.saveAs("hitsJson")) // Save a List of hit Maps into the session
)
...
Simulation
...
val scrollQueries = scenario("Enrichment Topologies").exec(ScrollQueryInitiator.query, repeat(numberOfPagesToScrollThrough, "scrollQueryCounter"){
exec(ScrollQuery.query, pause(10 seconds).foreach("${hitsJson}", "hit"){ exec(HitProcessor.query) })
})
...
HitProcessor
...
def getBody(session: Session): String = {
val hit = session("hit").as[Map[String,Any]]
val id = hit("_id").asInstanceOf[String]
val source = mapAsScalaMap(hit("_source").asInstanceOf[java.util.LinkedHashMap[String,Any]])
source.put("newfield", "testvalue")
val sourceJson = new ObjectMapper().writeValueAsString(mapAsJavaMap(source))
val json = s"""{"doc":${sourceJson}}"""
json
}
def getId(session: Session): String = {
val hit = session("hit").as[Map[String,Any]]
val id = URLEncoder.encode(hit("_id").asInstanceOf[String], "UTF-8")
val uri = s"/${index}/${doctype}/${id}/_update"
uri
}
val query = exec(http(s"Index Item")
.post(session => getId(session))
.body(StringBody(session => getBody(session))).asJSON)
...
Disclaimer: This code still needs optimising! And I haven't actually learnt much scala yet. Feel free to comment with better solutions
Having done this, what I really want to achieve now is to parallelise a given number of the indexing tasks. ie: I get 1000 hits back, I want to execute an index task for each individual hit, but rather than just iterating over them and doing them one after another, I want to do 10 at a time concurrently.
However, I think this is a separate question, really, so I'll present it as such.

IndexedDb dynamically populating existing ObjectStores

I have created many, let's say three ObjectStores inside a defined and versioned IndexedDB schema.
I need to populate all of them. To do so, I created an object which stores both name end endpoint (where it gets data to populate).
Also to avoid error when trying to fill objectstores already populated, I use the count() method to ... count key inside the objectstore and if there are 0 key, then populates, else reads.
It works perfect if I execute on a one by one basis, that is instead of using a loop, declare and execute each one of the three objectstores.
However when invoking the function to populate each storage inside a loop, I get the following error message for the last two objectstores to be populated:
Failed to read the 'result' property from 'IDBRequest': The request
has not finished. at IDBRequest.counts.(anonymous function).onsuccess
Here is the code:
// object contains oject stores and endpoints.
const stores = [
{osName:'user-1', osEndPoint:'/api/1,
{osName:'user-2', osEndPoint:'/api/2},
{osName:'user-3', osEndPoint:'/api/3}
];
// open db.
var request = indexedDB.open(DB_NAME, DB_VERSION);
// in order to dynamically create vars, instantiate two arrays.
var tx = [];
var counts = [];
var total = [];
// onsuccess callback.
request.onsuccess = function (e) {
db = this.result;
for(k in stores) {
tx[k] = db.transaction(stores[k].osName).objectStore(stores[k].osName);
counts[k] = tx[i].count();
counts[k].onsuccess = function(e) {
total[k] = e.target.result;
// if the counting result equals 0, then populate by calling a function that does so.
if (total[k] == 0) {
fetchGet2(stores[k].osEndPoint, popTable, stores[k].osName); //
} else {
readData(DB_NAME, DB_VERSION, stores[0].osName);
}
};
} // closes for loop
}; // closes request.onsuccess.
The fetchGet2 function works well inside a loop, for example the loop used to create the objectstores, and also has been tested on a one by one basis.
It looks like an async issue, however I cannot figure how to fix the problem which is to be able to populate existing objectstores dynamically, avoiding to populate filled objectsores and only filling empty ones.
Indeed testing without the count issue, but inside the loop works perfect, or with the count but without loop.
At the moment and when logging counts[k], It only logs data for the last member of the object.
Thanks in advance, I'm coding with vanilla js, and I'm not interested in using any framework at all.
Yes, this looks like an issue with async. For loops iterate synchronously. Try writing a loop that does not advance i until each request completes.

Resources