Is it possible to use persisted face ids in Identify process - face-detection

In order to use the "Identify", I first need to detect faces in an image using "Face Detect", then I have to pass those to faceIds (at most 10 at a time) to be used in identification. Those faces expire after 24 hours.
FaceIDs stored in LargeFaceList (or FaceList) are not expired and can be used until they are deleted. So can I use FaceIDs stored in a LargeFaceList in "Identify" calls? Or do I have to keep using Face Detect to use "Identify"?

May I ask what's your scenario to use the faceId for "Identify" after 24 hours?
And second, to answer your question, the current API does not support using "persistedFaceId" as input to "Identify" call, if the time frame past 24 hours, you have to do "Detect" again to get the "faceId" to feed to "Identify". If you can provide your scenario, this would help the Service provider to understand the need.

Related

Gatling - How to set up Looping users?

According to the documentation apparently I can either insert a fixed number of users at a per-second rate for a given time using constantUsersPerSec(rate) during(duration) or insert a number of users using atOnceUsers(nbUsers).
What I would like is a fixed number of concurrent users looping. You using constantUsersPerSec(5), if the first batch didn't finish, in the next sec, five new users will be pushed. I would like to only push a new user when the any finishes.
In another words I want something like: atOnceUsers(nbUsers) and whenever any ends up, add another one and keep that up for 5 minutes
How can I achieve that?
What I would like is a fixed number of concurrent users
atOnceUsers
looping
https://gatling.io/docs/current/general/scenario#loop-statements
Injection profile only control users spawning. Users' lifespan is controlled by the scenario.
it sounds like you want
constantConcurrentUsers(5) during (5 minutes)

Aggregation Strategy based on message size of aggregate

I would like to aggregate exchanges, and when then exchange hits a certain size (say 20KB) I would like to mark the exchange as closed.
I have a rudimentary implementation that checks size of the exchange and if it is 18KB I return true from my predicate. However, if a messages comes in that is 4KB and it is currently 17KB that will mean I will complete the aggregation when it is 21KB which is too big.
Any ideas on how to solve this? Can I do something in the aggregation strategy to reject the join and start a new Exchange to aggregate on?
I figured I could put it through another process to check actual size remove messages off the end of the message to fit the size, and for each removed message, push them back through...but that seems a little ugly because I have a constantly compensating routine that would likely execute.
Thanks in advance for any tips.
I think there is an eager complete option you can use to mark it as complete when you have that 17 + 4 > 20 situation. Then it will complete the 17, and start a new group with the 4.
See the docs at: https://github.com/apache/camel/blob/master/camel-core/src/main/docs/eips/aggregate-eip.adoc
And you would also likely need to use `PreCompleteAggregationStrategy' and return true in that 17 + 4 > 20 situation, as otherwise it would group them together first and complete, eg so it becomes 21. But by using both the eager completion check option and this interface you can do as you want.
https://github.com/apache/camel/blob/master/camel-core/src/main/java/org/apache/camel/processor/aggregate/PreCompletionAwareAggregationStrategy.java

Understanding Datastore Get RPCs in Google App Engine

I'm using sharded counters (https://cloud.google.com/appengine/articles/sharding_counters) in my GAE application for performance reasons, but I'm having some trouble understanding why it's so slow and how I can speed things up.
Background
I have an API that grabs a set of 20 objects at a time and for each object, it gets a total from a counter to include in the response.
Metrics
With Appstats turned on and a clear cache, I notice that getting the totals for 20 counters makes 120 RPCs by datastore_v3.Get which takes 2500ms.
Thoughts
This seems like quite a lot of RPC calls and quite a bit of time for reading just 20 counters. I assumed this would be faster and maybe that's where I'm wrong. Is it supposed to be faster than this?
Further Inspection
I dug into the stats a bit more, looking at these two lines in the get_count method:
all_keys = GeneralCounterShardConfig.all_keys(name)
for counter in ndb.get_multi(all_keys):
If I comment out the get_multi line, I see that there are 20 RPC calls by datastore_v3.Get totaling 185ms.
As expected, this leaves get_multi to be the culprit for 100 RPC calls by datastore_v3. Get taking upwards of 2500 ms. I verified this, but this is where I'm confused. Why does calling get_multi with 20 keys cause 100 RPC calls?
Update #1
I checked out Traces in the GAE console and saw some additional information. They show a breakdown of the RPC calls there as well - but in the sights they say to "Batch the gets to reduce the number of remote procedure calls." Not sure how to do that outside of using get_multi. Thought that did the job. Any advice here?
Update #2
Here are some screen shots that show the stats I'm looking at. The first one is my base line - the function without any counter operations. The second one is after a call to get_count for just one counter. This shows a difference of 6 datastore_v3.Get RPCs.
Base Line
After Calling get_count On One Counter
Update #3
Based on Patrick's request, I'm adding a screenshot of info from the console Trace tool.
Try splitting up the for loop that goes through each item and the actual get_multi call itself. So something like:
all_values = ndb.get_multi(all_keys)
for counter in all_values:
# Insert amazeballs codes here
I have a feeling it's one of these:
The generator pattern (yield from for loop) is causing something funky with get_multi execution paths
Perhaps the number of items you are expecting doesn't match actual result counts, which could reveal a problem with GeneralCounterShardConfig.all_keys(name)
The number of shards is set too high. I've realized that anything over 10 shards causes performance issues.
When I've dug into similar issues, one thing I've learned is that get_multi can cause multiple RPCs to be sent from your application. It looks like the default in the SDK is set to 1000 keys per get, but the batch size I've observed in production apps is much smaller: something more like 10 (going from memory).
I suspect the reason it does this is that at some batch size, it actually is better to use multiple RPCs: there is more RPC overhead for your app, but there is more Datastore parallelism. In other words: this is still probably the best way to read a lot of datastore objects.
However, if you don't need to read the absolute most current value, you can try setting the db.EVENTUAL_CONSISTENCY option, but that seems to only be available in the older db library and not in ndb. (Although it also appears to be available via the Cloud Datastore API).
Details
If you look at the Python code in the App Engine SDK, specifically the file google/appengine/datastore/datastore_rpc.py, you will see the following lines:
max_count = (Configuration.max_get_keys(config, self.__config) or
self.MAX_GET_KEYS)
...
if is_read_current and txn is None:
max_egs_per_rpc = self.__get_max_entity_groups_per_rpc(config)
else:
max_egs_per_rpc = None
...
pbsgen = self._generate_pb_lists(indexed_keys_by_entity_group,
base_req.ByteSize(), max_count,
max_egs_per_rpc, config)
rpcs = []
for pbs, indexes in pbsgen:
rpcs.append(make_get_call(base_req, pbs,
self.__create_result_index_pairs(indexes)))
My understanding of this:
Set max_count from the configuration object, or 1000 as a default
If the request must read the current value, set max_gcs_per_rpc from the configuration, or 10 as a default
Split the input keys into individual RPCs, using both max_count and max_gcs_per_rpc as limits.
So, this is being done by the Python Datastore library.

How to make datastore keys mapreduce-friendly(-er)?

Edit: See my answer. Problem was in our code. MR works fine, it may have a status reporting problem, but at least the input readers work fine.
I ran an experiment several times now and I am now sure that mapreduce (or DatastoreInputReader) has odd behavior. I suspect this might have something to do with key ranges and splitting them, but that is just my guess.
Anyway, here's the setup we have:
we have an NDB model called "AdGroup", when creating new entities
of this model - we use the same id returned from AdWords (it's an
integer), but we use it as string: AdGroup(id=str(adgroupId))
we have 1,163,871 of these entities in our datastore (that's what
the "Datastore Admin" page tells us - I know it's not entirely
accurate number, but we don't create/delete adgroups very often, so
we can say for sure, that the number is 1.1 million or more).
mapreduce is started (from another pipeline) like this:
yield mapreduce_pipeline.MapreducePipeline(
job_name='AdGroup-process',
mapper_spec='process.adgroup_mapper',
reducer_spec='process.adgroup_reducer',
input_reader_spec='mapreduce.input_readers.DatastoreInputReader',
mapper_params={
'entity_kind': 'model.AdGroup',
'shard_count': 120,
'processing_rate': 500,
'batch_size': 20,
},
)
So, I've tried to run this mapreduce several times today without changing anything in the code and without making changes to the datastore. Every time I ran it, mapper-calls counter had a different value ranging from 450,000 to 550,000.
Correct me if I'm wrong, but considering that I use the very basic DatastoreInputReader - mapper-calls should be equal to the number of entities. So it should be 1.1 million or more.
Note: the reason why I noticed this issue in the first place is because our marketing guys started complaining that "it's been 4 days after we added new adgroups and they still don't show up in your app!".
Right now, I can think of only one workaround - write all keys of all adgroups into a blobstore file (one per line) and then use BlobstoreLineInputReader. The writing to blob part would have to be written in a way that does not utilize DatastoreInputReader, of course. Should I go with this for now, or can you suggest something better?
Note: I have also tried using DatastoreKeyInputReader with the same code - the results were similar - mapper-calls were between 450,000 and 550,000.
So, finally questions. Is it important how you generate ids for your entities? Is it better to use int ids instead of str ids? In general, what can I do to make it easier for mapreduce to find all of my entities mapping them?
PS: I'm still in the process of experimenting with this, I might add more details later.
After further investigation we have found that the error was actually in our code. So, mapreduce actually works as expected (mapper is called for every single datastore entity).
Our code was calling some google services functions that were sometimes failing (the wonderful cryptic ApplicationError messages). Due to these failures, MR tasks were being retried. However, we have set a limit on taskqueue retries. MR did not detect nor report this in any way - MR was still showing "success" in the status page for all shards. That is why we thought that everything is fine with our code and that there is something wrong with the input reader.

EXCEEDED_ID_LIMIT: emptyRecycleBin id limit reached: 200

I'm just wondering if anyone else has seen this and if so, can you confirm that this is correct? The documentation claims, as you might expect, that 10,000 is the record limit for the system call:
Database.emptyRecycleBin(records);
not 200. Yet it's throwing an error at 200. The only thing I can think of is that this call occurs from within a batch Apex process.
It took a little over a week and me supplying a failing test case to salesforce support but the issue is now being reported as a salesforce known issue suggesting it may get addressed in the platform.
My workaround for now is to wrap the call in a Database.Batchable with the batch size to 200.
This is the only reference that I could find to there being a limit of 200 on emptyrecyclebin(), I dare say that you are correct
http://www.salesforce.com/us/developer/docs/api/Content/sforce_api_calls_emptyrecyclebin.htm
Adam, if you got shut down when attempting to log a case regarding this due to the whole Premier Support thing you should definitely escalate your case as it was handled incorrectly and SFDC needs to know about it. I had the same exact issue myself.
SOQL For Loops may be a helpful option for working around this limit as the 'for (Account[] accounts : [SELECT Id FROM Account WHERE IsDeleted = true ALL ROWS]' format provides batches of 200.

Resources