How to verify that every call of a load test generated a successful result at the end of a process chain? - artillery

I have an application that goes like this:
ingestion --> queue --> validation --> persistance --> database
I want to load test the ingestion and at the end verify that every submitted entry is stored in the database.
I have an Artillery script that posts to ingestion and recovers the same item from the database, but it does so as part of the same scenario and since the two components are implemented separately I'm actually measuring a combined performance, instead of that of each component.
I would like to load test the ingestion component keeping hold of some search key that allow me to recover all sent items from the database. I've tried this by creating a Javascript that I call at the beginning of the ingestion scenario to generate a random search key, store it in Artillery's context and them at the end of the scenario call another function to recover all entries from the database.
The problem I found is that Artillery runs one copy of the scenario in each virtual client, so it calls the function each time it starts the scenario and recovers only one entry at the end. And the call to the database happens in the same scenario as the post to ingestion, so I'm again mixing performance.
What I would like to do, I suppose, would be to generate the search key in a scenario, run the posts in another scenario, and then retrieve the results in a third one. How can I do that?
Also, when I retrieve the results from the database, I would like to compare the quantity with the number of posts to ingestion. I couldn't find if expect works with variables returned in the context from function calls. Is this possible?

I don't believe this is possible. I have been reading the documentation and any examples I can find about Artillery scripts, and I don't see that there is any way to "chain" flows together.


Gatling: Reusing once extracted multiple values with .check(regex

I'am trying to extract values from session only once, and use it in next sessions.
//First transaction used in scenario
val goHomepage = http("OpenHomepage")
.check(css("ul.sublist a" , "href").findAll.saveAs("categories"))
In last line I've extract all the categories (e.g. Notebooks, Phones, etc.)
This is my very first transaction in scenario. This categories is used in next ones.
So if I have more than one virtual user, does this mean that every time this line will perform same action and will save List of this categories for every session or overwrite itself?
If so how can I get this list only once and save it between requests, without overwriting it ? Or it's extracted just once and no need to worry about resource consumption?
if the call to 'OpenHomepage' returns the same data for each user, then each will carry the sublist in its session.
Why would you only want to execute this once? If you're trying to model 20 users each logging into the website, isn't it realistic to have them each hit the homepage?
Failing that, if the sublist of categories is reasonably constant, you could just hardcode that into your scenario or put them in a csv. Either way - if you want any kind of dynamic behaviour based on the contents of "categories", you'll need them in each user's session anyway.
If you really must execute "OpenHomepage" only once you can hack it together by doing something like what is described here

Datastore sometimes fails to fetch all required entities, but works the second time

I have a datastore entity called lineItems, which consists of individual line items to be invoiced. The users find the line items and attach a purchase order number to the line items. These are they displayed on the web page where they can create the invoice.
I would display my code for fetching the entities, but I don't think it matters at all as this also happened a couple times when I was using managed VM's a few months ago and the code is completely different. (I was using objectify before, now I am using the datastore API). In a nutshell, I am currently just using a StructuredQuery.setFilter(new PropertyFilter.eq("POnum",ponum)).setFilter(new PropertyFilter.eq("Invoiced", false)); (this is pseudo code you can't do two .setFilters like this. The real code accepts a list of PropertyFilters and creates a composite filter properly.)
What happened this morning was the admin person created the invoice, and all but two of the lines were on the invoice. There were two lines which the code never fetched, and those lines were stuck in the "invoices to create" section.
The admin person simply created the invoice again for the given purchase order number, but the second time it DID pick up the two remaining lines and created a second invoice.
Note that the entities were created/edited almost 24 hours before (when she assigned the purchase order number to them), so they were sitting in the database for quite a while. (I checked my logs). This is not a case where they were just created, and then tried to be accessed within a short period of time. It is also NOT a case of failing to update the entities - the code creates the invoice in a 3'rd party accounting package, and they simply were not there. Upon success of the invoice creation, all of the entities are then updated with "invoiced = true" and written in the datastore. So the lines which were not on the invoice in the accounting program are the ones that weren't updated in the datastore. (This is not a "smart" check either, it does not check line-by line. It simply checks if the invoice creation was successful or not, and then updates all of the entities that it has in memory).
As far as I can tell, the datastore simply did not return all of the entities which matched the query the first time but it did the second time.
There are approximately 40'000 lineItem entities.
What are the conditions which can cause a datastore fetch to randomly fail to grab all of the entities which meet the search parameters of a StructuredQuery? (Note that this also happened twice while using Objectify on the now deprecated Managed VM architecture.) How can I stop this from happening, or check to see if it has happened?
You may be seeing eventual consistency because you are not using an ancestor query.

Concurrency with Objectify in GAE

I created a test web application to test persist-read-delete of Entities, I created a simple loop to persist an Entity, retrieve and modify it then delete it for 100 times.
At some interval of the loop there's no problem, however there are intervals that there is an error that Entity already exist and thus can't be persisted (a custom exception handling I added).
Also at some interval of the loop, the Entity can't be modified because it does not exist, and finally at some interval the Entity can't be deleted because it does not exist.
I understand that the loop may be so fast that the operation to the Appengine datastore is not yet complete. Thus causing, errors like Entity does not exist, when trying to access it or the delete operation is not yet finished so creating an Entity with the same ID can't be created yet and so forth.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
From what I understand you are doings something like the following:
for i in range(0,100):
ent = My_Entity() # create and save entity
ent = db.get(ent.key()) # get, modify and save the entity = 'foo'
ent.get(ent.key()) # get and delete the entity
with some error checking to make sure you have entities to delete, modify, and you are running into a bunch of errors about finding the entity to delete or modify. As you say, this is because the calls aren't guaranteed to be executed in order.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
You're best bet for this is to batch any modifications you are doing for an entity persisting. For example if you are going to be creating/saving/modifying/savings or modifying/saving/deleting where ever possible try to combine these steps (ie create/modify/save or modify/delete). Not only will this avoid the errors you're seeing but it will also cut down on your RPCs. Following this strategy the above loop would be reduced to...
prop = None
for i in range(0,100):
prop = 'foo'
Put in other words, for anything that requires setting/deleting that quickly just use a local variable. That's GAE's answer for you. After you figure out all the quick stuff you can't persist that information in an entity.
Other than that there isn't much you can do. Transactions can help you if you need to make sure a bunch of entities are updated together but won't help if you're trying to multiple things to one entity at once.
EDIT: You could also look at the pipelines API.

How to efficiently check if a result set changed and serve it to a web application for syndication

Here is the scenario:
I am handling a SQL Server database with a stored procedure which takes care of returning headers for Web feed items (RSS/Atom) I am serving as feeds through a web application.
This stored procedure should, when called by the service broker task running at a given interval, verify if there has been a significant change in the underlying data - in that case, it will trigger a resource intensive activity of formatting the feed item header through a call to the web application which will get/retrieve the data, format them and return to the SQL database.
There the header would be stored ready for a request for RSS feed update from the client.
Now, trying to design this to be as efficient as possible, I still have a couple of turning point I'd like to get your suggestions about.
My tentative approach at the stored procedure would be:
get together the data in a in-memory table,
create a subquery with the signature columns which change with the information,
convert them to XML with a FOR XML AUTO
hash the result with MD5 (with HASHBYTES or fn_repl_hash_binary depending on the size of the result)
verify if the hash matches with the one stored in the table where I am storing the HTML waiting for the feed requests.
if Hash matches do nothing otherwise proceed for the updates.
The first doubt is the best way to check if the base data have changed.
Converting to XML inflates significantly the data -which slows hashing-, and potentially I am not using the result apart from hashing: is there any better way to perform the check or to pack all the data together for hashing (something csv-like)?
The query is merging and aggregating data from multiple tables, so would not rely on table timestamps as their change is not necessarily related to a change in the result set
The second point is: what is the best way to serve the data to the webapp for reformatting?
- I might push the data through a CLR function to the web application to get data formatted (but this is synchronous and for multiple feed item would create unsustainable delay)
I might instead save the result set instead and trigger multiple asynchronous calls through the service broker. The web app might retrieve the data stored in some way instead of running again the expensive query which got them.
Since I have different formats depending on the feed item category, I cannot use the same table format - so storing to a table is going to be hard.
I might serialize to XML instead.
But is this going to provide any significant gain compared to re-running the query?
For the efficient caching bit, have a look at query notifications. The tricky bit in implementing this in your case is you've stated "significant change" whereas query notifications will trigger on any change. But the basic idea is that your application subscribes to a query. When the results of that query change, a message is sent to the application and it does whatever it is programmed to do (typically refreshing cached data).
As for serving the data to your app, there's a saying in the business: "don't go borrowing trouble". Which is to say if the default method of serving data (i.e. a result set w/o fancy formatting) isn't causing you a problem, don't change it. Change it only if and when it's causing you a significant enough headache that your time is best spent there.

Google App Engine: efficient large deletes (about 90000/day)

I have an application that has only one Model with two StringProperties.
The initial number of entities is around 100 million (I will upload those with the bulk loader).
Every 24 hours I must remove about 70000 entities and add 100000 entities. My question is now: what is the best way of deleting those entities?
Is there anyway to avoid fetching the entity before deleting it? I was unable to find a way of doing something like:
DELETE from xxx WHERE foo1 IN ('bar1', 'bar2', 'bar3', ...)
I realize that app engine offers an IN clause (albeit with a maximum length of 30 (because of the maximum number of individual requests per GQL query 1)), but to me that still seems strange because I will have to get the x entities and then delete them again (making two RPC calls per entity).
Note: the entity should be ignored if not found.
EDIT: Added info about problem
These entities are simply domains. The first string being the SLD and the second the TLD (no subdomains). The application can be used to preform a request like this http://[...]/available/ . The application will return a True/False json object.
Why do I have so many entities? Because the datastore contains all registered domains (.com for now). I cannot perform a whois request in every case because of TOSs and latency. So I initially populate the datastore with an entire zone file and then daily add/remove the domains that have been registered/dropped... The problem is, that these are pretty big quantities and I have to figure out a way to keep costs down and add/remove 2*~100000 domains per day.
Note: there is hardly any computation going on as an availability request simply checks whether the domain exists in the datastore!
1: ' A maximum of 30 datastore queries are allowed for any single GQL query.' (
If are not doing so already you should be using key_names for this.
You'll want a model something like:
class UnavailableDomain(db.Model):
Then you will populate your datastore like:
Then you will query for available domains with something like:
is_available = UnavailableDomain.get_by_key_name('') is None
Then when you need to remove a bunch of domains because they have become available, you can build a big list of keys without having to query the database first like:
free_domains = ['', '']
db.delete(db.Key.from_path('UnavailableDomain', name) for name in free_domains)
I would still recommend batching up the deletes into something like 200 per RPC, if your free_domains list is really big
have you considered the appengine-mapreduce library. It comes with the pipeline library and you could utilise both to:
Create a pipeline for the overall task that you will run via cron every 24hrs
The 'overall' pipeline would start a mapper that filters your entities and yields the delete operations
after the delete mapper completes, the 'overall' pipeline could call an 'import' pipeline to start running your entity creation part.
pipeline api can then send you an email to report on it's status
