Background
I'm currently working on a capability analysis set of stress-testing tools for which I'm using gatling.
Part of this involves loading up an elasticsearch with scroll queries followed by update API calls.
What I want to achieve
Step 1: Run the scroll initiator and save the _scroll_id where it can be used by further scroll queries
Step 2: Run a scroll query on repeat, as part of each scroll query make a modification to each hit returned and index it back into elasticsearch, effectively spawning up to 1000 Actions from the one scroll query action, and having the results sampled.
Step 1 is easy. Step 2 not so much.
What I've tried
I'm currently trying to achieve this via a ResponseTransformer that parses JSON-formatted results, makes modifications to each one and fires off a thread for each one that attempts another exec(http(...).post(...) etc) to index the changes back into elasticsearch.
Basically, I think I'm going about it the wrong way about it. The indexing threads never get run, let alone sampled by gatling.
Here's the main body of my scroll query action:
...
val pool = Executors.newFixedThreadPool(parallelism)
val query = exec(http("Scroll Query")
.get(s"/_search/scroll")
.body(ElFileBody("queries/scrollquery.json")).asJSON // Do the scroll query
.check(jsonPath("$._scroll_id").saveAs("scroll_id")) // Get the scroll ID from the response
.transformResponse { case response if response.isReceived =>
new ResponseWrapper(response) {
val responseJson = JSON.parseFull(response.body.string)
// Get the hits and
val hits = responseJson.get.asInstanceOf[Map[String, Any]]("hits").asInstanceOf[Map[String,Any]]("hits").asInstanceOf[List[Map[String, Any]]]
for (hit <- hits) {
val id = hit.get("_id").get.asInstanceOf[String]
val immutableSource = hit.get("_source").get.asInstanceOf[Map[String, Any]]
val source = collection.mutable.Map(immutableSource.toSeq: _*) // Make the map mutable
source("newfield") = "testvalue" // Make a modification
Thread.sleep(pause) // Pause to simulate topology throughput
pool.execute(new DocumentIndexer(index, doctype, id, source)) // Create a new thread that executes the index request
}
}
}) // Make some mods and re-index into elasticsearch
...
DocumentIndexer looks like this:
class DocumentIndexer(index: String, doctype: String, id: String, source: scala.collection.mutable.Map[String, Any]) extends Runnable {
...
val httpConf = http
.baseURL(s"http://$host:$port/${index}/${doctype}/${id}")
.acceptHeader("application/json")
.doNotTrackHeader("1")
.disableWarmUp
override def run() {
val json = new ObjectMapper().writeValueAsString(source)
exec(http(s"Index ${id}")
.post("/_update")
.body(StringBody(json)).asJSON)
}
}
Questions
Is this even possible using gatling?
How can I achieve what I want to achieve?
Thanks for any help/suggestions!
It's possible to achieve this by using jsonPath to extract the JSON hit array and saving the elements into the session and then, using a foreach in the action chain and exec-ing the index task in the loop you can perform the indexing accordingly.
ie:
ScrollQuery
...
val query = exec(http("Scroll Query")
.get(s"/_search/scroll")
.body(ElFileBody("queries/scrollquery.json")).asJSON // Do the scroll query
.check(jsonPath("$._scroll_id").saveAs("scroll_id")) // Get the scroll ID from the response
.check(jsonPath("$.hits.hits[*]").ofType[Map[String,Any]].findAll.saveAs("hitsJson")) // Save a List of hit Maps into the session
)
...
Simulation
...
val scrollQueries = scenario("Enrichment Topologies").exec(ScrollQueryInitiator.query, repeat(numberOfPagesToScrollThrough, "scrollQueryCounter"){
exec(ScrollQuery.query, pause(10 seconds).foreach("${hitsJson}", "hit"){ exec(HitProcessor.query) })
})
...
HitProcessor
...
def getBody(session: Session): String = {
val hit = session("hit").as[Map[String,Any]]
val id = hit("_id").asInstanceOf[String]
val source = mapAsScalaMap(hit("_source").asInstanceOf[java.util.LinkedHashMap[String,Any]])
source.put("newfield", "testvalue")
val sourceJson = new ObjectMapper().writeValueAsString(mapAsJavaMap(source))
val json = s"""{"doc":${sourceJson}}"""
json
}
def getId(session: Session): String = {
val hit = session("hit").as[Map[String,Any]]
val id = URLEncoder.encode(hit("_id").asInstanceOf[String], "UTF-8")
val uri = s"/${index}/${doctype}/${id}/_update"
uri
}
val query = exec(http(s"Index Item")
.post(session => getId(session))
.body(StringBody(session => getBody(session))).asJSON)
...
Disclaimer: This code still needs optimising! And I haven't actually learnt much scala yet. Feel free to comment with better solutions
Having done this, what I really want to achieve now is to parallelise a given number of the indexing tasks. ie: I get 1000 hits back, I want to execute an index task for each individual hit, but rather than just iterating over them and doing them one after another, I want to do 10 at a time concurrently.
However, I think this is a separate question, really, so I'll present it as such.
Related
I've written a CasperJS script that works very well except that it takes a (very very) long time to scrape pages.
In a nutshell, here's the pseudo code:
my functions to scrape the elements
my casper.start() to start the navigation and log in
casper.then() where I loop through an array and store my links
casper.thenOpen() to open each link and call my functions to scrap.
It works perfectly (and fast enough) for scraping a bunch of links. But when it comes to thousands (right now I'm running the script with an array of 100K links), the execution time is endless: the first 10K links have been scrapped in 3h54m10s and the following 10K in 2h18m27s.
I can explain a little bit the difference between the two 10K batches : the first includes the looping & storage of the array with the 100K links. From this point, the scripts only open pages to scrap them. However, I noticed the array was ready to go after roughly 30 minutes so it doesn't explain exactly the time gap.
I've placed my casper.thenOpen() in the for loop hoping that after each new link built and stored in the array, the scrapping will happen. Now, I'm sure I've failed this but will it change anything in terms of performance ?
That's the only lead I have in mind right now and I'd be very thankful if anyone is willing to share his/her best practices to reduce significantly the running time of the script's execution (shouldn't be hard!).
EDIT #1
Here's my code below:
var casper = require('casper').create();
var fs = require('fs');
// This array maintains a list of links to each HOL profile
// Example of a valid URL: https://myurl.com/list/74832
var root = 'https://myurl.com/list/';
var end = 0;
var limit = 100000;
var scrapedRows = [];
// Returns the selector element property if the selector exists but otherwise returns defaultValue
function querySelectorGet(selector, property, defaultValue) {
var item = document.querySelector(selector);
item = item ? item[property] : defaultValue;
return item;
}
// Scraping function
function scrapDetails(querySelectorGet) {
var info1 = querySelectorGet("div.classA h1", 'innerHTML', 'N/A').trim()
var info2 = querySelectorGet("a.classB span", 'innerHTML', 'N/A').trim()
var info3 = querySelectorGet("a.classC span", 'innerHTML', 'N/A').trim()
//For scraping different texts of the same kind (i.e: comments from users)
var commentsTags = document.querySelectorAll('div.classComments');
var comments = Array.prototype.map.call(commentsTags, function(e) {
return e.innerText;
})
// Return all the rest of the information as a JSON string
return {
info1: info1,
info2: info2,
info3: info3,
// There is no fixed number of comments & answers so we join them with a semicolon
comments : comments.join(' ; ')
};
}
casper.start('http://myurl.com/login', function() {
this.sendKeys('#username', 'username', {keepFocus: true});
this.sendKeys('#password', 'password', {keepFocus: true});
this.sendKeys('#password', casper.page.event.key.Enter, {keepFocus: true});
// Logged In
this.wait(3000,function(){
//Verify connection by printing welcome page's title
this.echo( 'Opened main site titled: ' + this.getTitle());
});
});
casper.then( function() {
//Quick summary
this.echo('# of links : ' + limit);
this.echo('scraping links ...')
for (var i = 0; i < limit; i++) {
// Building the urls to visit
var link = root + end;
// Visiting pages...
casper.thenOpen(link).then(function() {
// We pass the querySelectorGet method to use it within the webpage context
var row = this.evaluate(scrapDetails, querySelectorGet);
scrapedRows.push(row);
// Stats display
this.echo('Scraped row ' + scrapedRows.length + ' of ' + limit);
});
end++;
}
});
casper.then(function() {
fs.write('infos.json', JSON.stringify(scrapedRows), 'w')
});
casper.run( function() {
casper.exit();
});
At this point I probably have more questions than answers but let's try.
Is there a particular reason why you're using CasperJS and not Curl for example ? I can understand the need for CasperJS if you are going to scrape a site that uses Javascript for example. Or you want to take screenshots. Otherwise I would probably use Curl along with a scripting language like PHP or Python and take advantage of the built-in DOM parsing functions.
And you can of course use dedicated scraping tools like Scrapy. There are quite a few tools available.
Then the 'obvious' question: do you really need to have arrays that large ? What you are trying to achieve is not clear, I am assuming you will want to store the extracted links to a database or something. Isn't it possible to split the process in small batches ?
One thing that should help is to allocate sufficient memory by declaring a fixed-size array ie:
var theArray = new Array(1000);
Resizing the array constantly is bound to cause performance issues. Every time new items are added to the array, expensive memory allocation operations must take place in the background, and are repeated as the loop is being run.
Since you are not showing any code, so we cannot suggest meaningful improvements, just generalities.
I'm using Slick with Play but am having some problems when trying to update a column value as it is not being updated, although I don't get any error back.
I have a column that tells me if the given row is selected or not. What I want to do is to get the current selected value (which is stored in a DB column) and then update that same column to have the opposite value. Currently (after many unsuccessful attempts) I have the following code which compiles and runs but nothing happens behind the scenes:
val action = listItems.filter(_.uid === uid).map(_.isSelected).result.map { selected =>
val isSelected = selected.head
println(s"selected before -> $isSelected")
val q = for {li <- listItems if li.uid === uid} yield li.isSelected
q.update(!isSelected)
}
db.run(action)
What am I doing wrong (I am new to slick so this may not make any sense at all!)
this needs to be seperate actions: one read followed by an update. Slick allows composition of actions in a neat way:
val targetRows = listItems.filter(_.uid === uid).map(_.isSelected)
val actions = for {
booleanOption <- targetRows.result.headOption
updateActionOption = booleanOption.map(b => targetRows.update(!b))
affected <- updateActionOption.getOrElse(DBIO.successful(0))
} yield affected
db.run(actions)
Update:
Just as a side note, RDBMSs usually facilitate constructs for performing updates in one database roundtrip such as updating a boolean column to it's opposite value without needing to read it first and manually negate it. This would look like this in mysql forexample:
UPDATE `table` SET `my_bool` = NOT my_bool
but to my knowledge the high level slick api doesn't support this construct. hence the need for two seperate database actions in your case. I myself would appreciate it if somebody proved me wrong.
I have created many, let's say three ObjectStores inside a defined and versioned IndexedDB schema.
I need to populate all of them. To do so, I created an object which stores both name end endpoint (where it gets data to populate).
Also to avoid error when trying to fill objectstores already populated, I use the count() method to ... count key inside the objectstore and if there are 0 key, then populates, else reads.
It works perfect if I execute on a one by one basis, that is instead of using a loop, declare and execute each one of the three objectstores.
However when invoking the function to populate each storage inside a loop, I get the following error message for the last two objectstores to be populated:
Failed to read the 'result' property from 'IDBRequest': The request
has not finished. at IDBRequest.counts.(anonymous function).onsuccess
Here is the code:
// object contains oject stores and endpoints.
const stores = [
{osName:'user-1', osEndPoint:'/api/1,
{osName:'user-2', osEndPoint:'/api/2},
{osName:'user-3', osEndPoint:'/api/3}
];
// open db.
var request = indexedDB.open(DB_NAME, DB_VERSION);
// in order to dynamically create vars, instantiate two arrays.
var tx = [];
var counts = [];
var total = [];
// onsuccess callback.
request.onsuccess = function (e) {
db = this.result;
for(k in stores) {
tx[k] = db.transaction(stores[k].osName).objectStore(stores[k].osName);
counts[k] = tx[i].count();
counts[k].onsuccess = function(e) {
total[k] = e.target.result;
// if the counting result equals 0, then populate by calling a function that does so.
if (total[k] == 0) {
fetchGet2(stores[k].osEndPoint, popTable, stores[k].osName); //
} else {
readData(DB_NAME, DB_VERSION, stores[0].osName);
}
};
} // closes for loop
}; // closes request.onsuccess.
The fetchGet2 function works well inside a loop, for example the loop used to create the objectstores, and also has been tested on a one by one basis.
It looks like an async issue, however I cannot figure how to fix the problem which is to be able to populate existing objectstores dynamically, avoiding to populate filled objectsores and only filling empty ones.
Indeed testing without the count issue, but inside the loop works perfect, or with the count but without loop.
At the moment and when logging counts[k], It only logs data for the last member of the object.
Thanks in advance, I'm coding with vanilla js, and I'm not interested in using any framework at all.
Yes, this looks like an issue with async. For loops iterate synchronously. Try writing a loop that does not advance i until each request completes.
I'm writing a Hadoop application calculates map data at a certain resolution. My Input files are tiles of a map, named according to the QuadTile principle. I need to subsample those, and stitch those together until I have a certain higher-level tile which covers a larger area but at a lower resolution. Like zooming out in google maps.
Currently my Mapper subsamples tiles and my reducer combines tiles a a certain level and forms tiles of one level up. So for so good. But depending on which tile I need, I need to repeat those map and reduce steps a x times, which I have not been able to do so far.
What would be the best way to do so? Is it possible without explicitly saving the tiles in some temp directory and starting a new mapreduce Job on those temp dirs until I get what I want? What I think would be the perfect solution is something roughly like 'while(context.hasMoreThanOneKey()){iterate mapreduce}'.
Following an answer, I have now written a class TileJob which extends Job. However, the mapreduce is still not chained. Could you tell me what I'm doing wrong?
public boolean waitForCompletion(boolean verbose) throws IOException, InterruptedException, ClassNotFoundException{
if(desiredkeylength != currentinputkeylength-1){
System.out.println("In loop, setting input at " + tempout);
String tempin = tempout;
FileInputFormat.setInputPaths(this, tempin);
tempout = (output + currentinputkeylength + "/");
FileOutputFormat.setOutputPath(this, new Path(tempout));
System.out.println("Setting output at " + tempout);
currentinputkeylength--;
Configuration conf = new Configuration();
TileJob job = new TileJob(conf);
job.setJobName(getJobName());
job.setUpJob(tempin, tempout, tiletogenerate, currentinputkeylength);
return job.waitForCompletion(verbose);
}else{
//desiredkeylength == currentkeylength-1
System.out.println("In else, setting input at " + tempout);
String tempin = tempout;
FileInputFormat.setInputPaths(this, tempin);
tempout = output;
FileOutputFormat.setOutputPath(this, new Path(tempout));
System.out.println("Setting output at " + tempout);
currentinputkeylength--;
Configuration conf = new Configuration();
TileJob job = new TileJob(conf);
job.setJobName(getJobName());
job.setUpJob(tempin, tempout, tiletogenerate, currentinputkeylength);
currentinputkeylength--;
return super.waitForCompletion(verbose);
}
}
Usually you kick a mapreduce step off by having a driver class main method that configures the Job, Configuration and format type (input and output). Once everything's ready to go that main method calls Job::waitForCompletion() which submits the job and waits for the job to complete before continuing.
You can wrap some of that logic in a loop that repeatedly calls Job::waitForCompletion() until your criteria is met. You can implement your criteria using counters. Put logic into your reduce() method to set or increment a counter with the number of keys. Your loop in the driver class can get the value of that (distributed) counter from the Job instance and you code your while expression using that value.
What file locations you use is up to you. Inside this driver loop you can change the file location for the inputs and outputs, or keep them the same.
I should probably add that you ought to go ahead and create a new Job and Configuration instance inside the loop. I don't know that those objects are reusable in this situation.
public static void main(String[] args) {
int keys = 2;
boolean completed = true;
while (completed & (keys > 1)) {
Job job = new Job();
// Do all your job configuration here
completed = job.waitForCompletion();
if (completed) {
keys = job.getCounter().findCounter("Total","Keys").getValue();
}
}
}
I have written a custom request handler in solr to meet my business requirements. The handler involves getting data from two different from SolrIndexSearchers. I want the returned doclists from the two SolrIndexSearchers merged into one.
I tried iterating through one and adding doc by doc to another, but all I could get was an "Unsupported Operation" exception. Is there anyway to merge two doclists?
[Edit 1] : Code snippet inside the overridden handleRequestBody method
SolrCore core = new SolrCore("Desired Directory 1", schema);
reader = IndexReader.open("Desired Directory 1");
searcher = new SolrIndexSearcher(core, schema, getName(), reader, false);
Sort lsort = null;
FilteredQuery filter = null;
DocList results1 = searcher.getDocList(query, filter, lsort, 0, 10);
reader.close();
searcher.close();
core.close();
SolrCore core = new SolrCore("Desired Directory 2", schema);
reader = IndexReader.open("Desired Directory 2");
searcher = new SolrIndexSearcher(core, schema, getName(), reader, false);
Sort lsort = null;
FilteredQuery filter = null;
DocList results2 = searcher.getDocList(query, filter, lsort, 0, 10);
reader.close();
searcher.close();
core.close();
rsp.add("response",results1);
rsp.add("response",results2);
Now that I have two DocLists results1 and results2, how do I merge them?
[Edit 2] : The problem is not an exception/stack trace. When I add two responses, I get the results in two response sets when it is a single machine search. When it is a distributed search, I only get the distribution between response 1 of machine 1 and response 1 of machine 2. IN my understanding, only when I merge the responses to a single set, I will be able to get proper distribution. Hope I am understandable?
DocList is not supposed to be changed after you retrieve it from a search result.
I had the similar problem and, if I remember correctly, I used:
org.apache.solr.util.SolrPluginUtils.docListToSolrDocumentList
to convert DocList to SolrDocumentList, which extends ArrayList<SolrDocument>, which consequently supports add(SolrDocument).
So the worst case scenario (from the performance standpoint) is to convert the first DocList to SolrDocumentList and then loop through all other DocLists calling add on each doc.
You'll have to test how efficient this approach is. I'm not Solr expert, but this is where I'd start testing.