gatling group actions in foreach into parallel chunks - gatling

I have a gatling scenario where I retrieve 1000 documents from a database via a RESTful API.
I then modify the documents and send update requests for each.
This is how I'm currently doing it:
...
val scrollQueries = scenario("Enrichment Topologies").exec(ScrollQueryInitiator.query, repeat(numberOfPagesToScrollThrough, "scrollQueryCounter"){
exec(ScrollQuery.query, pause(10 seconds).foreach("${hitsJson}", "hit"){ exec(HitProcessor.query) })
})
...
Here are the main features of interest:
ScrollQuery.query fetches the 1000 results and saves them into hitsJson in the session.
It then pauses for 10 seconds to simulate longer-term processing.
The 1000 results are iterated over and for each item a HitProcessor is run which sends the update request
In reality, the foreach loop ensures that each request is sent one after the other.
Question
What I really want is to work through the 1000 results in groups of 10, sending update requests in parallel 10 at a time.
How can I achieve this?

Try moving the fetch part to before hook.
Now that you have the data you can start 10 threads
setUp( scn.inject(atOnceUsers(10)))

Related

GCP PubSub - How to enqueue asynchronous message?

I would like to have information about the setting of the publisher in the pubsub environment of gcp. I would like to enqueue messages that will be consumed via a google function. To achieve this, the publication will trigger when a number of messages is reached or from a certain time.
I set the topic as follows :
topic.PublishSettings = pubsub.PublishSettings{
ByteThreshold: 1e6, // Publish a batch when its size in bytes reaches this value. (1e6 = 1Mo)
CountThreshold: 100, // Publish a batch when it has this many messages.
DelayThreshold: 10 * time.Second, // Publish a non-empty batch after this delay has passed.
}
When I call the publish function, I have a 10 second delay on each call. Messages are not added to the queue ...
for _, v := range list {
ctx := context.Background()
res := a.Topic.Publish(ctx, &pubsub.Message{Data: v})
// Block until the result is returned and a server-generated
// ID is returned for the published message.
serverID, err = res.Get(ctx)
if err != nil {
return "", err
}
}
Someone can help me ?
Cheers
Batching the publisher side is designed to allow for more cost efficiency when sending messages to Google Cloud Pub/Sub. Given that the minimum billing unit for the service is 1KB, it can be cheaper to send multiple messages in the same Publish request. For example, sending two 0.5KB messages as separate Publish requests would result in being changed for sending 2KB of data (1KB for each). If one were to batch that into a single Publish request, it would be charged as 1KB of data.
The tradeoff with batching is latency: in order to fill up batches, the publisher has to wait to receive more messages to batch together. The three batching properties (ByteThreshold, CountThreshold, and DelayThreshold) allow one to control the level of that tradeoff. The first two properties control how much data or how many messages we put in a single batch. The last property controls how long the publisher should wait to send a batch.
As an example, imagine you have CountThreshold set to 100. If you are publishing few messages, it could take awhile to receive 100 messages to send as a batch. This means that the latency for messages in that batch will be higher because they are sitting in the client waiting to be sent. With DelayThreshold set to 10 seconds, that means that a batch would be sent if it had 100 messages in it or if the first message in the batch was received at least 10 seconds ago. Therefore, this is putting a limit on the amount of latency to introduce in order to have more data in an individual batch.
The code as you have it is going to result in batches with only a single message that each take 10 seconds to publish. The reason is the call to res.Get(ctx), which will block until the message has been successfully sent to the server. With CountThreshold set to 100 and DelayThreshold set to 10 seconds, the sequence that is happening inside your loop is:
A call to Publish puts a message in a batch to publish.
That batch is waiting to receive 99 more messages or for 10 seconds to pass before sending the batch to the server.
The code is waiting for this message to be sent to the server and return with a serverID.
Given the code doesn't call Publish again until res.Get(ctx) returns, it waits 10 seconds to send the batch.
res.Get(ctx) returns with a serverID for the single message.
Go back to 1.
If you actually want to batch messages together, you can't call res.Get(ctx) before the next Publish call. You'll want to either call publish inside a goroutine (so one routine per message) or you'll want to amass the res objects in a list and then call Get on them outside the loop, e.g.:
var res []*PublishResult
ctx := context.Background()
for _, v := range list {
res = append(res, a.Topic.Publish(ctx, &pubsub.Message{Data: v}))
}
for _, r := range res {
serverID, err = r.Get(ctx)
if err != nil {
return "", err
}
}
Something to keep in mind is that batching will optimize cost on the publish side, not on the subscribe side. Cloud Functions is built with push subscriptions. This means that messages must be delivered to the subscriber one at a time (since the response code is what is used to ack or nack each message), which means there is no batching of messages delivered to the subscriber.

Contant timer delays the thread more than is set

I have the Test plan where is 10 requests. Just requests without Contant timer takes about 18 seconds. When I add one Contant timer with 1000 miliseconds delay after the third request It takes about 28 seconds.
Is It problem of the JMeter or I'm doing something wrong?
I'm running at Ubuntu - ElementaryOS with JMeter v. 2.11 r1554548.
I'm testing another server not mine laptop.
At Jmeter test plan I'm using Cache, Cookie manager and Request Defaults at the begin. One request with POST action. And Summary report, Graph results, View results in Table a Simple data writer at the end of test plan.
Everything is in one thread.
Order of timer object has no impact and does not mean it executes where it is located.
In fact it will apply to every child request of the parent of timer.
Read this:
http://jmeter.apache.org/usermanual/test_plan.html
4.10 Scoping Rules

Which NDB query function is more efficient to iterate through a big set of query results?

I use NDB for my app and use iter() with limit and starting cursor to iterate through 20,000 query results in a task. A lot of time I run into timeout error.
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
The way I make the call is like this:
results = query.iter(limit=20000, start_cursor=cursor, produce_cursors=True)
for item in results:
process(item)
save_cursor_for_next_time(results.cursor_after().urlsafe())
I can reduce the limit but I thought a task can run as long as 10 mins. 10 mins should be more than enough time to go through 20000 results. In fact, on a good run, the task can complete in just about a minute.
If I switched to fetch() or fetch_page(), would they be more efficient and less likely to run into the timeout error? I suspect there's a lot of overhead in iter() that causes the timeout error.
Thanks.
Fetch is not really any more efficient they all use the same mechanism, unless you know how many entities you want upfront - then fetch can be more efficient as you end up with just one round trip.
You can increase the batch size for iter, that can improve things. See https://developers.google.com/appengine/docs/python/ndb/queryclass#kwdargs_options
From the docs the default batch size is 20, which would mean for 20,000 entities a lot of batches.
Other things that can help. Consider using map and or map_async on the processing, rather than explicitly calling process(entity) Have a read https://developers.google.com/appengine/docs/python/ndb/queries#map also introducing async into your processing can mean improved concurrency.
Having said all of that you should profile so you can understand where the time is used. For instance the delays could be in your process due to things you are doing there.
There are other things to conside with ndb like context caching, you need to disable it. But I also used iter method for these. I also made an ndb version of the mapper api with the old db.
Here is my ndb mapper api that should solve timeout problems and ndb caching and easily create this kind of stuff:
http://blog.altlimit.com/2013/05/simple-mapper-class-for-ndb-on-app.html
with this mapper api you can create it like or you can just improve it too.
class NameYourJob(Mapper):
def init(self):
self.KIND = YourItemModel
self.FILTERS = [YourItemModel.send_email == True]
def map(self, item):
# here is your process(item)
# process here
item.send_email = False
self.update(item)
# Then run it like this
from google.appengine.ext import deferred
deferred.defer(NameYourJob().run, 50, # <-- this is your batch
_target='backend_name_if_you_want', _name='a_name_to_avoid_dups')
For potentially long query iterations, we use a time check to ensure slow processing can be handled. Given the disparities in GAE infrastructure performance, you will likely never find an optimal processing number. The code excerpt below is from an on-line maintenance handler we use which generally runs within ten seconds. If not, we get a return code saying it needs to be run again thanks to our timer check. In your case, you would likely break the process after passing the cursor to your next queue task. Here is some sample code which is edited down to hopefully give you a good idea of our logic. One other note: you may choose to break this up into smaller bites and then fan out the smaller tasks by re-enqueueing the task until it completes. Doing 20k things at once seems very aggressive in GAE's highly variable environment. HTH -stevep
def over_dt_limit(start, milliseconds):
dt = datetime.datetime.now() - start
mt = float(dt.seconds * 1000) + (float(dt.microseconds)/float(1000))
if mt > float(milliseconds):
return True
return False
#set a start time
start = datetime.datetime.now()
# handle a timeout issue inside your query iteration
for item in query.iter():
# do your loop logic
if over_dt_limit(start, 9000):
# your specific time-out logic here
break

Bulk Download via Google App Engine Backend

I have 1.6 Million entities in a Google App Engine app that I would like to download. I tried using the built in bulkloader mechanism but found that it is terribly slow. While I can only download ~30 entities/second via the bulkloader, I can do ~500 entities/second by querying the datastore via a backend. A backend is necessary to circumvent the 60 second request limit. In addition, datastore queries can only live for up to 30 seconds so you need to break up your fetches across multiple queries using query cursors.
The code on the server side fetches an 1000 entities and returns a query cursor:
cursor = request.get('cursor')
devices = Pushdev.all()
if (cursor and cursor!=''):
devices.with_cursor(cursor)
next1000 = devices.fetch(1000)
for d in next1000:
t = int(time.mktime(d.created.timetuple()))
response.out.write('%s/%s/%d\n'%(d.name,d.alias,t))
response.out.write(devices.cursor())
On the client side, I have a loop that invokes the handler on the server with a null cursor to begin with and then starts to pass the cursor received by the previous invocation. It terminates when it gets an empty result.
PROBLEM: I am only able to fetch a fraction - ~20% of the entities using this method. I get a response with empty data even though the full set of entities has not been traversed. Why does this method not fetch everything comprehensively?
I couldn't find anything to confirm or deny this in the docs, but my guess is that all() has a non-deterministic ordering such that eventually one of your fetch(1000)'s will hit the "last element" and devices.cursor() will return nothing.
Try this:
devices = Pushdev.all().order('__key__')

how long the map call can last?

I want to do some heavy processing in the map() call of the mapper.
I was going through the source file MapReduceServlet.java:
// Amount of time to spend on actual map() calls per task execution.
public static final int PROCESSING_TIME_PER_TASK_MS = 10000;
Does it mean, the map call can last only for 10secs. What happens after 10sec?
Can I increase this to large number like 1min or 10min.
-Aswath
MapReduce operations are executed in tasks using Push Queues, and as said in the documentation the task deadline is currently 10 minutes (limit after which you will get a DeadlineExceededException).
If the task failed to execute, by default App Engine retries it until it succeed. If you need longer deadline that 10 minutes, you can use Backend for executing your tasks.
Looking at the actual usage of PROCESSING_TIME_PER_TASK_MS in Worker.java, this value is used to limit the number of map call done in a single task.
After each map call has been executed if more than 10s has elapsed since the beginning of the task, it will spawn a new task to handle the rest of the map calls.
Worker.scheduleWorker spawns a new Task for a each given shard
Each task will call Worker.processMapper
processMapper execute 1 map call
if less than PROCESSING_TIME_PER_TASK_MS have elapsed since 2. go back to 3.
else if processing is not finished reschedule a new worker task
In the worst case scenario the default task request deadline (10 minutes) should apply to each of your individual map call.

Resources