According to the documentation apparently I can either insert a fixed number of users at a per-second rate for a given time using constantUsersPerSec(rate) during(duration) or insert a number of users using atOnceUsers(nbUsers).
What I would like is a fixed number of concurrent users looping. You using constantUsersPerSec(5), if the first batch didn't finish, in the next sec, five new users will be pushed. I would like to only push a new user when the any finishes.
In another words I want something like: atOnceUsers(nbUsers) and whenever any ends up, add another one and keep that up for 5 minutes
How can I achieve that?
What I would like is a fixed number of concurrent users
atOnceUsers
looping
https://gatling.io/docs/current/general/scenario#loop-statements
Injection profile only control users spawning. Users' lifespan is controlled by the scenario.
it sounds like you want
constantConcurrentUsers(5) during (5 minutes)
Related
In order to use the "Identify", I first need to detect faces in an image using "Face Detect", then I have to pass those to faceIds (at most 10 at a time) to be used in identification. Those faces expire after 24 hours.
FaceIDs stored in LargeFaceList (or FaceList) are not expired and can be used until they are deleted. So can I use FaceIDs stored in a LargeFaceList in "Identify" calls? Or do I have to keep using Face Detect to use "Identify"?
May I ask what's your scenario to use the faceId for "Identify" after 24 hours?
And second, to answer your question, the current API does not support using "persistedFaceId" as input to "Identify" call, if the time frame past 24 hours, you have to do "Detect" again to get the "faceId" to feed to "Identify". If you can provide your scenario, this would help the Service provider to understand the need.
I'm using sharded counters (https://cloud.google.com/appengine/articles/sharding_counters) in my GAE application for performance reasons, but I'm having some trouble understanding why it's so slow and how I can speed things up.
Background
I have an API that grabs a set of 20 objects at a time and for each object, it gets a total from a counter to include in the response.
Metrics
With Appstats turned on and a clear cache, I notice that getting the totals for 20 counters makes 120 RPCs by datastore_v3.Get which takes 2500ms.
Thoughts
This seems like quite a lot of RPC calls and quite a bit of time for reading just 20 counters. I assumed this would be faster and maybe that's where I'm wrong. Is it supposed to be faster than this?
Further Inspection
I dug into the stats a bit more, looking at these two lines in the get_count method:
all_keys = GeneralCounterShardConfig.all_keys(name)
for counter in ndb.get_multi(all_keys):
If I comment out the get_multi line, I see that there are 20 RPC calls by datastore_v3.Get totaling 185ms.
As expected, this leaves get_multi to be the culprit for 100 RPC calls by datastore_v3. Get taking upwards of 2500 ms. I verified this, but this is where I'm confused. Why does calling get_multi with 20 keys cause 100 RPC calls?
Update #1
I checked out Traces in the GAE console and saw some additional information. They show a breakdown of the RPC calls there as well - but in the sights they say to "Batch the gets to reduce the number of remote procedure calls." Not sure how to do that outside of using get_multi. Thought that did the job. Any advice here?
Update #2
Here are some screen shots that show the stats I'm looking at. The first one is my base line - the function without any counter operations. The second one is after a call to get_count for just one counter. This shows a difference of 6 datastore_v3.Get RPCs.
Base Line
After Calling get_count On One Counter
Update #3
Based on Patrick's request, I'm adding a screenshot of info from the console Trace tool.
Try splitting up the for loop that goes through each item and the actual get_multi call itself. So something like:
all_values = ndb.get_multi(all_keys)
for counter in all_values:
# Insert amazeballs codes here
I have a feeling it's one of these:
The generator pattern (yield from for loop) is causing something funky with get_multi execution paths
Perhaps the number of items you are expecting doesn't match actual result counts, which could reveal a problem with GeneralCounterShardConfig.all_keys(name)
The number of shards is set too high. I've realized that anything over 10 shards causes performance issues.
When I've dug into similar issues, one thing I've learned is that get_multi can cause multiple RPCs to be sent from your application. It looks like the default in the SDK is set to 1000 keys per get, but the batch size I've observed in production apps is much smaller: something more like 10 (going from memory).
I suspect the reason it does this is that at some batch size, it actually is better to use multiple RPCs: there is more RPC overhead for your app, but there is more Datastore parallelism. In other words: this is still probably the best way to read a lot of datastore objects.
However, if you don't need to read the absolute most current value, you can try setting the db.EVENTUAL_CONSISTENCY option, but that seems to only be available in the older db library and not in ndb. (Although it also appears to be available via the Cloud Datastore API).
Details
If you look at the Python code in the App Engine SDK, specifically the file google/appengine/datastore/datastore_rpc.py, you will see the following lines:
max_count = (Configuration.max_get_keys(config, self.__config) or
self.MAX_GET_KEYS)
...
if is_read_current and txn is None:
max_egs_per_rpc = self.__get_max_entity_groups_per_rpc(config)
else:
max_egs_per_rpc = None
...
pbsgen = self._generate_pb_lists(indexed_keys_by_entity_group,
base_req.ByteSize(), max_count,
max_egs_per_rpc, config)
rpcs = []
for pbs, indexes in pbsgen:
rpcs.append(make_get_call(base_req, pbs,
self.__create_result_index_pairs(indexes)))
My understanding of this:
Set max_count from the configuration object, or 1000 as a default
If the request must read the current value, set max_gcs_per_rpc from the configuration, or 10 as a default
Split the input keys into individual RPCs, using both max_count and max_gcs_per_rpc as limits.
So, this is being done by the Python Datastore library.
Edit: See my answer. Problem was in our code. MR works fine, it may have a status reporting problem, but at least the input readers work fine.
I ran an experiment several times now and I am now sure that mapreduce (or DatastoreInputReader) has odd behavior. I suspect this might have something to do with key ranges and splitting them, but that is just my guess.
Anyway, here's the setup we have:
we have an NDB model called "AdGroup", when creating new entities
of this model - we use the same id returned from AdWords (it's an
integer), but we use it as string: AdGroup(id=str(adgroupId))
we have 1,163,871 of these entities in our datastore (that's what
the "Datastore Admin" page tells us - I know it's not entirely
accurate number, but we don't create/delete adgroups very often, so
we can say for sure, that the number is 1.1 million or more).
mapreduce is started (from another pipeline) like this:
yield mapreduce_pipeline.MapreducePipeline(
job_name='AdGroup-process',
mapper_spec='process.adgroup_mapper',
reducer_spec='process.adgroup_reducer',
input_reader_spec='mapreduce.input_readers.DatastoreInputReader',
mapper_params={
'entity_kind': 'model.AdGroup',
'shard_count': 120,
'processing_rate': 500,
'batch_size': 20,
},
)
So, I've tried to run this mapreduce several times today without changing anything in the code and without making changes to the datastore. Every time I ran it, mapper-calls counter had a different value ranging from 450,000 to 550,000.
Correct me if I'm wrong, but considering that I use the very basic DatastoreInputReader - mapper-calls should be equal to the number of entities. So it should be 1.1 million or more.
Note: the reason why I noticed this issue in the first place is because our marketing guys started complaining that "it's been 4 days after we added new adgroups and they still don't show up in your app!".
Right now, I can think of only one workaround - write all keys of all adgroups into a blobstore file (one per line) and then use BlobstoreLineInputReader. The writing to blob part would have to be written in a way that does not utilize DatastoreInputReader, of course. Should I go with this for now, or can you suggest something better?
Note: I have also tried using DatastoreKeyInputReader with the same code - the results were similar - mapper-calls were between 450,000 and 550,000.
So, finally questions. Is it important how you generate ids for your entities? Is it better to use int ids instead of str ids? In general, what can I do to make it easier for mapreduce to find all of my entities mapping them?
PS: I'm still in the process of experimenting with this, I might add more details later.
After further investigation we have found that the error was actually in our code. So, mapreduce actually works as expected (mapper is called for every single datastore entity).
Our code was calling some google services functions that were sometimes failing (the wonderful cryptic ApplicationError messages). Due to these failures, MR tasks were being retried. However, we have set a limit on taskqueue retries. MR did not detect nor report this in any way - MR was still showing "success" in the status page for all shards. That is why we thought that everything is fine with our code and that there is something wrong with the input reader.
Like I said in the topic,
My team developed social website based on Zend Framework (1.11).
The problem is that our client wants a debug (on screen list of DB queries) with
execution time, how many rows were affected and statement sentence.
Time and statement Zend_DB_profiler gets for us with no hassle, but
we need also the amount of rows that query affected (fetched, update, inserted or deleted).
Please help, how to cope with this task?
This is the current implementation which prevents you from getting what you want.
$qp->start($this->_queryId);
$retval = $this->_execute($params);
$prof->queryEnd($this->_queryId);
Possible solution would be:
Create your own class for Statement, let's say extended from Zend_Db_Statement_Mysqli or what have you.
Redefine its execute method so that $retval is carried on into $prof->queryEnd($this->_queryId);
Redefine $_defaultStmtClass in Zend_Db_Adapter_Mysqli with your new statement class name
Create your own Zend_Db_Profiler with public function queryEnd($queryId) redefined, so it accepts $retval and handles $retval
In the video/PDF from "Data pipelines with Google App Engine" Brett puts "now / 30" into the task name noting that he will explain the reason later, but somehow he never does. :)
http://www.youtube.com/watch?v=zSDC_TU7rtc#t=41m35
task_name = '%s-%d-%d' % (sum_name, int(now / 30), index)
Do you have any idea about the reason? Does it have anything to do with the 7 day period in which one can't re-use task names?
Link to the session page
Brett Slatkin's own explanation
[Brett]
Hey all,
The int(time.time()/30) part of the task name is to prevent queue stalls. When memcache gets evicted the work index counter will be reset to zero. That means new fork-join work items may insert tasks that are named the same as tasks that were already inserted. By including a time window of ~30 seconds in the task name, we ensure that this problem can only last for about thirty seconds. This is also why you should raise an exception when you see a TombstonedTaskError exception.
Worst-case scenario if the clocks are wonky is that two tasks are run to do the fan-in work instead of just one, which is an acceptable trade-off in many cases and a fundamental possibility when using the task queue API. This can be mitigated using pigeon-hole acknowledgment entities, like I use in my materialized view example.
Hope that helps,
[/Brett]