Best practices to execute faster a CasperJS script that scrapes thousands of pages - arrays

I've written a CasperJS script that works very well except that it takes a (very very) long time to scrape pages.
In a nutshell, here's the pseudo code:
my functions to scrape the elements
my casper.start() to start the navigation and log in
casper.then() where I loop through an array and store my links
casper.thenOpen() to open each link and call my functions to scrap.
It works perfectly (and fast enough) for scraping a bunch of links. But when it comes to thousands (right now I'm running the script with an array of 100K links), the execution time is endless: the first 10K links have been scrapped in 3h54m10s and the following 10K in 2h18m27s.
I can explain a little bit the difference between the two 10K batches : the first includes the looping & storage of the array with the 100K links. From this point, the scripts only open pages to scrap them. However, I noticed the array was ready to go after roughly 30 minutes so it doesn't explain exactly the time gap.
I've placed my casper.thenOpen() in the for loop hoping that after each new link built and stored in the array, the scrapping will happen. Now, I'm sure I've failed this but will it change anything in terms of performance ?
That's the only lead I have in mind right now and I'd be very thankful if anyone is willing to share his/her best practices to reduce significantly the running time of the script's execution (shouldn't be hard!).
EDIT #1
Here's my code below:
var casper = require('casper').create();
var fs = require('fs');
// This array maintains a list of links to each HOL profile
// Example of a valid URL: https://myurl.com/list/74832
var root = 'https://myurl.com/list/';
var end = 0;
var limit = 100000;
var scrapedRows = [];
// Returns the selector element property if the selector exists but otherwise returns defaultValue
function querySelectorGet(selector, property, defaultValue) {
var item = document.querySelector(selector);
item = item ? item[property] : defaultValue;
return item;
}
// Scraping function
function scrapDetails(querySelectorGet) {
var info1 = querySelectorGet("div.classA h1", 'innerHTML', 'N/A').trim()
var info2 = querySelectorGet("a.classB span", 'innerHTML', 'N/A').trim()
var info3 = querySelectorGet("a.classC span", 'innerHTML', 'N/A').trim()
//For scraping different texts of the same kind (i.e: comments from users)
var commentsTags = document.querySelectorAll('div.classComments');
var comments = Array.prototype.map.call(commentsTags, function(e) {
return e.innerText;
})
// Return all the rest of the information as a JSON string
return {
info1: info1,
info2: info2,
info3: info3,
// There is no fixed number of comments & answers so we join them with a semicolon
comments : comments.join(' ; ')
};
}
casper.start('http://myurl.com/login', function() {
this.sendKeys('#username', 'username', {keepFocus: true});
this.sendKeys('#password', 'password', {keepFocus: true});
this.sendKeys('#password', casper.page.event.key.Enter, {keepFocus: true});
// Logged In
this.wait(3000,function(){
//Verify connection by printing welcome page's title
this.echo( 'Opened main site titled: ' + this.getTitle());
});
});
casper.then( function() {
//Quick summary
this.echo('# of links : ' + limit);
this.echo('scraping links ...')
for (var i = 0; i < limit; i++) {
// Building the urls to visit
var link = root + end;
// Visiting pages...
casper.thenOpen(link).then(function() {
// We pass the querySelectorGet method to use it within the webpage context
var row = this.evaluate(scrapDetails, querySelectorGet);
scrapedRows.push(row);
// Stats display
this.echo('Scraped row ' + scrapedRows.length + ' of ' + limit);
});
end++;
}
});
casper.then(function() {
fs.write('infos.json', JSON.stringify(scrapedRows), 'w')
});
casper.run( function() {
casper.exit();
});

At this point I probably have more questions than answers but let's try.
Is there a particular reason why you're using CasperJS and not Curl for example ? I can understand the need for CasperJS if you are going to scrape a site that uses Javascript for example. Or you want to take screenshots. Otherwise I would probably use Curl along with a scripting language like PHP or Python and take advantage of the built-in DOM parsing functions.
And you can of course use dedicated scraping tools like Scrapy. There are quite a few tools available.
Then the 'obvious' question: do you really need to have arrays that large ? What you are trying to achieve is not clear, I am assuming you will want to store the extracted links to a database or something. Isn't it possible to split the process in small batches ?
One thing that should help is to allocate sufficient memory by declaring a fixed-size array ie:
var theArray = new Array(1000);
Resizing the array constantly is bound to cause performance issues. Every time new items are added to the array, expensive memory allocation operations must take place in the background, and are repeated as the loop is being run.
Since you are not showing any code, so we cannot suggest meaningful improvements, just generalities.

Related

Google Apps Script: Use array out of spreadsheet

I try to use Google Script Apps (instead of VBA which I am more used to) and managed now to create a loop over different spreadsheets (and not only different sheets in one document) using the forEach function.
(I tried with a for (r=1;r=lastRow; r++) but I did not manage).
It is working now defining the array for the sheetnames manually:
var SheetList = ["17DCu1nyyX4a6zCkkT3RfBSfo-ghoc2fXEX8chlVMv5k", "1rRGQHs_JShPSBIGFCdG6AqXM967JFhdlfQ92cf5ISL8", "1pFDyXgYmvC5gnN5AU5xJ8vGiihwtubcbG2n4LPhPACQ", "1mK_X4Q7ysJQTt8NZoZASBE5zuUllPmmWSJsxu5Dnu9Y", "1FpjIGWTG5_6MMYJF72wvoiBRp_Xlt5BDpzvSZKcsU"]
And then for information the loop:
SheetList.forEach(function(r) {
var thisSpreadsheet = SpreadsheetApp.openById(r)
var thisData = thisSpreadsheet.getSheetByName('Actions').getDataRange()
var values = thisData.getValues();
var toWorksheet = targetSpreadsheetID.getSheetByName(targetWorksheetName);
var last = toWorksheet.getLastRow ()+ 1
var toRange = toWorksheet.getRange(last, 1, thisData.getNumRows(), thisData.getNumColumns())
toRange.setValues(values);
})
Now I want to create the definition of the array "automatically" out of the spreadsheet 'List' where all spreadsheets which I want to loop are listed in column C.
I tried several ideas, but always failed.
Most optimistic ones were:
var SheetList = targetSpreadsheetID.getSheetByName('List').getRange(2,3,lastRow-2,3).getValues()
And I also tried with the array-function:
var sheetList=Array.apply(targetSpreadsheetID.getSheetByName('List').getRange(2,3,lastRow-2,3))
but all without success.
It should be possible normally in more or less one single line to import the array from the speadsheet to the Google apps scripts?
I would very much appreciate if someone could please give me a hint where my mistake is.
Thank you very much.
Maria
I still did not manage to put the array as I wanted it initially, but now I found a workable solution with the For Loop which I want to share here in case someone is looking for a similar solution (and then finds at least my workaround ;) )
for (i=2; i<lastRow;i++){
var SheetList = targetSpreadsheetID.getSheetByName('List').getRange(i,3).getValues()
Logger.log(SheetList);
var thisSpreadsheet = SpreadsheetApp.openById(SheetList);
... // the rest identical to loop above...
Don't hesitate to add your comments or advice anyhow, but I will mark the question as closed.
Thanks a lot.
Maria

Cypress get length of a table dynamically

I'm trying to test the length of a react table.
In particular I would want that after a user add a new row to the table it's length would increase by one
In my code I can't extract the length value of the table and save it in the variable rowsLength so that it can be checked later, what am I doing wrong?
It seems rowsLength is an object not a number..
const rowsLength = cy.get(".MuiTableBody-root").find("tr").its("length");
cy.get("#addRowButton").click();//new tr inserted
cy.get(".MuiTableBody-root").find("tr").should("have.length", rowsLength + 1);
Setting a global variable quite often will not work, because of the asynchronous nature of a web app. The docs refer to this as doing a backflip.
The most reliable way to handle this is either with a closure or with an alias.
Closures
To access what each Cypress command yields you use .then().
cy.get(".MuiTableBody-root").find("tr").its('length').then(intialLength => {
// Code that uses initialLength is nested inside `.then()`
cy.get("#addRowButton").click();
cy.get(".MuiTableBody-root").find("tr").should("have.length", intialLength + 1);
})
Aliases
Allows you to separate the code that creates the variable from the code that uses it.
cy.get(".MuiTableBody-root").find("tr").its('length').as('intialLength');
cy.get("#addRowButton").click();
// other test code
cy.get('#intialLength').then(intialLength => {
cy.get(".MuiTableBody-root").find("tr").should("have.length", intialLength + 1);
})
Why not a global variable?
Using a global variable is anti-pattern because it often creates a race condition between the test code and the queued commands.
By luck, your example will work most times because cy.get("#addRowButton").click() is a slow operation.
But if you want to test that the row count has not changed, then using a global variable will fail.
Try this
var intialLength = 0;
cy.get('table').find("tr").then((tr) => {
intialLength = tr.length;
})
// Test that row count remains the same
cy.get('table').find("tr").should("have.length", intialLength);
// fails because it runs before the variable is set
This is because the test code runs much faster than the .get() and .find() commands, which must access the DOM.
Although the above example is trivial, remember that web apps are inherently asynchronous and backflip tests can easily fail.
With find() you can do something like:
var tableLen = 0;
cy.get(".MuiTableBody-root").find("tr").then((tr) => {
tableLen = tr.length,
})
and then:
cy.get("#addRowButton").click(); //new tr inserted
cy.get(".MuiTableBody-root").find("tr").should("have.length", tableLen + 1);

IndexedDb dynamically populating existing ObjectStores

I have created many, let's say three ObjectStores inside a defined and versioned IndexedDB schema.
I need to populate all of them. To do so, I created an object which stores both name end endpoint (where it gets data to populate).
Also to avoid error when trying to fill objectstores already populated, I use the count() method to ... count key inside the objectstore and if there are 0 key, then populates, else reads.
It works perfect if I execute on a one by one basis, that is instead of using a loop, declare and execute each one of the three objectstores.
However when invoking the function to populate each storage inside a loop, I get the following error message for the last two objectstores to be populated:
Failed to read the 'result' property from 'IDBRequest': The request
has not finished. at IDBRequest.counts.(anonymous function).onsuccess
Here is the code:
// object contains oject stores and endpoints.
const stores = [
{osName:'user-1', osEndPoint:'/api/1,
{osName:'user-2', osEndPoint:'/api/2},
{osName:'user-3', osEndPoint:'/api/3}
];
// open db.
var request = indexedDB.open(DB_NAME, DB_VERSION);
// in order to dynamically create vars, instantiate two arrays.
var tx = [];
var counts = [];
var total = [];
// onsuccess callback.
request.onsuccess = function (e) {
db = this.result;
for(k in stores) {
tx[k] = db.transaction(stores[k].osName).objectStore(stores[k].osName);
counts[k] = tx[i].count();
counts[k].onsuccess = function(e) {
total[k] = e.target.result;
// if the counting result equals 0, then populate by calling a function that does so.
if (total[k] == 0) {
fetchGet2(stores[k].osEndPoint, popTable, stores[k].osName); //
} else {
readData(DB_NAME, DB_VERSION, stores[0].osName);
}
};
} // closes for loop
}; // closes request.onsuccess.
The fetchGet2 function works well inside a loop, for example the loop used to create the objectstores, and also has been tested on a one by one basis.
It looks like an async issue, however I cannot figure how to fix the problem which is to be able to populate existing objectstores dynamically, avoiding to populate filled objectsores and only filling empty ones.
Indeed testing without the count issue, but inside the loop works perfect, or with the count but without loop.
At the moment and when logging counts[k], It only logs data for the last member of the object.
Thanks in advance, I'm coding with vanilla js, and I'm not interested in using any framework at all.
Yes, this looks like an issue with async. For loops iterate synchronously. Try writing a loop that does not advance i until each request completes.

Nodejs: Batch insert of large number of rows into a database

I want to process a large number of records (>400k) in a batch and insert them into a database.
I know how to iterate over an array with for() or underscore.each() and also I know how to insert a record into various (no)SQL databases asynchronously. That's not the problem - the problem is I can't figure a way to do both at the same time.
The database distribution itself doesn't play a role here, the principle applies for any (NO)SQL database with an async interface.
I'm looking for a pattern to solve the following problem:
The loop approach:
var results = []; //imagine 100k objects here
_.each(results,function(row){
var newObj = prepareMyData(row);
db.InsertQuery(newObj,function(err,response) {
if(!err) console.log('YAY, inserted successfully');
});
});
This approach is obviously flawed. It kinda hammers the database with insert queries without waiting for a single one to finish. Speaking about MySQL adapters using a connection pool, you pretty soon run out of connections and the script fails.
The recursion approach:
var results = []; //again, full of BIGDATA ;)
var index = 0;
var myRecursion = function()
{
var row = results[index];
var data = prepareMyData(row);
db.InsertQuery(data,function(err, response)
{
if (!err)
{
console.log('YAY, inserted successfully!');
index++; //increment for the next recursive call of:
if (index < results.length) myRecursion();
}
}
}
myRecursion();
While this approach works pretty well for small chunks of data (though it can be slow, but that's ok. the event loop can rest a while, waiting for the query to finish), it work doesn't for large arrays - too many recursions.
I could write a batch insert easily in any other procedural language like PHP or so, but I don't want to. I want to solve this, asynchronously, in nodejs - for educational purposes.
Any suggestions?
I've found a solution that works for me, but I'm still interested in understanding how this technically works.
Reading the node-async docs I found a few functions to achieve this:
async.map //iterates over an array
async.each //iterates over an array in parallel
async.eachSeries //iterates over an array sequentially
async.eachLimit //iterates over an array in parallel with n (limit) parallel calls.
For instance:
var results = []; //still huge array
// "4" means, async will fire the iterator function up to 4 times in parallel
async.eachLimit(results,4,function(row,cb){
var data = prepareMyData(row);
db.InsertQuery(data,function(err, response)
{
if (!err)
{
cb(err,response);
}
}
},function(err,res)
{
console.log('were done!');
});

flex 3: Can anybody help me optimize this array -> arrayCollection function?

I'm using a parent to pass a multi-dimensional array to a child. Structure of the array, named projectPositions is as follows (with example data):
projectPositions[0][0] = 1;
projectPositions[0][1] = 5;
projectPositions[0][2] = '1AD';
projectPositions[0][3] = 'User name';
I need to take this inherited array and turn it into an arrayCollection so that I can use it as a dataProvider. Currently, my init function (which runs onCreationComplete) has this code in it to handle this task of array -> arrayCollection:
for (var i:int = 0; i < projectPositions.length; i++)
{
tempObject = new Object;
tempObject.startOffset = projectPositions[i][0];
tempObject.numDays = projectPositions[i][1];
tempObject.role = projectPositions[i][2];
tempObject.student = projectPositions[i][3];
positionsAC.addItemAt(tempObject, positionsAC.length);
}
Then, during a repeater, I use positionsAC as the dataprovider and reference the items in the following way:
<mx:Repeater id="indPositions" dataProvider="{positionsAC}" startingIndex="0" count="{projectPositions.length}">
<components:block id="thisBlock" offSet="{indPositions.currentItem.startOffset}" numDays="{indPositions.currentItem.numDays}" position="{indPositions.currentItem.role}" sName="{indPositions.currentItem.student}" />
</mx:Repeater>
This all works fine and returns the desired effect, but the load time of this application is around 10 seconds. I'm 99% sure that the load time is caused by the array -> arrayCollection for loop. Is there an easier way to achieve the desired effect without having to wait so long for the page to load?
The issue your having loading items could be because you are using a repeater instead of a list class.
With a repeater, there will be a block created in memory, and drawn on the screen. So, if you have 100 items in your array, then 100 blocks will be created. this could slow down both initial creation and the overall app.
A list based class focuses on a technique called renderer recycling; which means only the displayed elements are created and rendered on the screen. So, depending on settings, you'd usually have 7-10 'block' instances on the screen, no matter how many items you have in your array.
change
positionsAC.addItemAt(tempObject, positionsAC.length);
to
positionsAC.addItem(tempObject);
addItemAt is causing a reindex of the collection which can greatly slow down the collection.
[EDIT]
Put this trace statement before and after the loop
take the output and subtract one from the other and that will show how many milliseconds the loop has run.
var date:Date = new Date( );
trace( date.getTime())

Resources