Based on Gatling report, how to make sure 100 requests are processed in less than 1 second - gatling

how can I check my requirement of 100 requests are processed in less than 1 second in my gatling3 report. I ran this using jenkins.
my simulation looks like as below
rampConcurrentUsers(1) to (100) during (161 second),
constantConcurrentUsers(100) during (1 minute)
Below is my response time percentile graph of two executions for an interval of one second.
enter image []1 here
What does the min,max here will tell us, i am assuming the percentages 25%-99% are the completion of the request.

Those graph sections are not what you're after - they show the distribution of response times and the number of active users.
So min is the fastest response time
max is the longest
95% is the response time for which 95% of your requests were under
and so on...
So what you could do is look at the section of your graph corresponding to the 100 constant concurrent users injection stage. In this part you would require that the max response time always be under 1 second
(Note: there's something odd with your 2nd report - I assume it didn't come from running the stated injection profile as it has more than 100 concurrent users active)

Related

get_by_key_name() in GAE taking as long as 750ms. Is this expected?

My program fetches ~100 entries in a loop. All entries are fetched using get_by_key_name(). Appstats show that some get_by_key_name() requests are taking as much as 750ms! (other big values are 355ms, 260ms, 230ms). Average for other fetches ranges from 30ms to 100ms. These times are in real_time and hence contribute towards 'ms' and not 'cpu_ms'.
Due to the above, total time taken to return the webpage is very high ms=5754, where cpu_ms=1472. (above times are seen repeatedly for back to back requests.)
Environment: Python 2.7, webapp2, jinja2, High Replication, No other concurrent requests to the server, Frontend Instance Class is F1, No memcache set yet, max idle instances is automatic, min pending latency is automatic, using db (NOT NDB).
Any help will be greatly appreciated as I based whole database design on fetching entries from the datastore using only get_by_key_name()!!
Update:
I tried profiling using time.clock() before and immediately after every get_by_key_name() method call. The difference I get from time.clock() for every single call is 10ms! (Just want to clarify that the get_by_key_name() is called on different Kinds).
According to time.clock() the total execution time (in wall-clock time) is 660ms. But the real-time is 5754 (=ms), and cpu_ms is 1472 per GAE logs.
Summary of Questions:
*[Update: This was addressed by passing list of keys] Why get_by_key_name() is taking that long?*
Why ms of 5754 is so much more than cpu_ms of 1472. Is task execution in halted/waiting-state for 75% (1-1472/5754) of the time due to which real-time (wall clock) time taken is so long as far as end user is concerned?
If the above is true, then why time.clock() shows that only 660ms (wall-clock time) elapsed between start of the first get_by_key_name() request and the last (~100th) get_by_key_name() request; although GAE shows this time as 5754ms?

Newrelic custom plugin metrics

I'm working directly with the HTTP API and trying to get some metrics from our storage.
The doc states "Tip: If you want the metric to appear as a percentage in the user interface, then you must define it as a percentage in the JSON."
However - I can't send metric values which are percentages; the POST response has status 400 with body
{"error":"Unable to parse request: null"}
My POST is
{"components": [
{"duration": 1,
"guid": "com.cumulus.Test5",
"name":"ServerX",
"metrics": {
"Component/Filesystem/root/Percentage Used": "62%"
}
}],
"agent": {"host": "vss-syd", "version": "1.0.0", "pid": 1080}
}
Also - I have a metric "Number of devices offline" (for a ZFS storage pool) which is discrete i.e. not continuous - so averages don't make sense, just absolute values.
For which I'd like to set an alert if it gets above 0.
I know the threshold is only 'greater than', so I can set thresholds # 0.1 Alert & 0.2 Critical no prob.
However - please can someone point me in the right direction as to how I should
Send such a metric (i.e. need to specify [units] and aggregates?)
Create the Summary Metric + Graphs in the frontend? (which 'Value' to select e.g. 'Calls per minute')
There are two issues that look like they could be the cause.
The first is that the duration should be 60, which represents the number of seconds for which the reported metrics correspond. NewRelic is optimized to work with this particular interval and while you can have larger values (300 seconds is the recommended maximum), the minimum required value is 60. Smaller values may be accepted by the API, but the results will be unpredictable.
The second is that the percentage used is a string value which should instead be reported as an integer value, such as 62, or a float value of 62.0 if you wish to preserve that level of precision.
Regarding the second portion of your question about reporting and displaying a metric related to "# of Failing Disks":
New Relic does not currently support reporting metrics that represent absolute values. All metric values are presented in aggregate over some particular time period. Summary Metrics are aggregated over the most recent ~4 minutes, while metrics on charts and tables are aggregated over the time period selected in the time picker.
That said, you could try something along the lines of "percentage of failing disks" where perhaps an average might still be useful in that any non-zero value indicates a failure.
This average would be of questionable value once the aggregation time period became larger than a few minutes. However, given that summary metrics are always aggregated over a fixed time period of ~4 minutes — and it is summary metrics that trigger alerts — this may still be useful to you.

What are some suggested LogParser queries to run to detect sources of high network traffic?

In looking at the network in/out metrics for our AWS/EC2 instance, I would like to find the sources of the high network traffic occurrences.
I have installed up Log Parser Studio and run a few queries - primarily looking for responses that took a while:
SELECT TOP 10000 * FROM '[LOGFILEPATH]' WHERE time-taken > 1000
I am also targeting time spans that cover when the network in/out spikes have occurred:
SELECT TOP 20000 * FROM '[LOGFILEPATH]'
WHERE [date] BETWEEN TIMESTAMP('2013-10-20 02:44:00', 'yyyy-MM-dd hh:mm:ss')
AND TIMESTAMP('2013-10-20 02:46:00', 'yyyy-MM-dd hh:mm:ss')
One issue is that the log files are 2-7 gigs (targeting single files per query). In trying Log Parser Lizard, it crashed with an out of memory exception on large files (boo).
What are some other queries, and methodologies I should follow to identify the source of the high network traffic, which would hopefully help me figure out how to plug the hole?
Thanks.
One function that may be of particular use to you is the QUANTIZE() function. This allows you to aggregate stats for a period of time thus allowing you to see spikes in a given time period. Here is one query I use that allows me to see when we get scanned:
SELECT QUANTIZE(TO_LOCALTIME(TO_TIMESTAMP(date, time)), 900) AS LocalTime,
COUNT(*) AS Hits,
SUM(sc-bytes) AS TotalBytesSent,
DIV(MUL(1.0, SUM(time-taken)), Hits) AS LoadTime,
SQRROOT(SUB(DIV(MUL(1.0, SUM(SQR(time-taken))), Hits), SQR(LoadTime))) AS StandardDeviation
INTO '[OUTFILEPATH]'
FROM '[LOGFILEPATH]'
WHERE '[WHERECLAUSE]'
GROUP BY LocalTime
ORDER BY LocalTime
I usually output this to a .csv file and then chart in Excel to visually see where a period of time is out of normal range. This particular query breaks things down to 15 min segments based on the 900 passed to QUANTIZE. The TotalBytesSent, LoadTime and StandardDeviation allow me to see other aberrations in downloaded content or response times.
Another thing to look at is the number of requests a particular client has made to your site. The following query can help identify scanning or DoS activity coming in:
SELECT
DISTINCT c-ip as ClientIP,
COUNT(*) as Hits,
PROPCOUNT(*) as Percentage
INTO '[OUTFILEPATH]'
FROM '[LOGFILEPATH]'
WHERE '[WHERECLAUSE]'
GROUP BY ClientIP
HAVING (Hits > 50)
ORDER BY Percentage DESC
Adjusting the HAVING clause will set the minimum number of requests an IP will need to make before it shows up. Based on the activity and the WHERE clause, 50 may be too low. The PROPCOUNT() function gives a percentage of the overall value of a particular field. In this case, it gives the what percent a particular IP of all the requests made to the site. Typically this will surface the IP addresses of search engines as well, but those are pretty easy to weed out.
I hope that gives you some ideas on what you can do.

GAE Task Queues how to make the delay?

In Task Queues code is executed to connect to the server side
through URL Fetch.
My file queue.yaml.
queue:
- Name: default
rate: 10 / m
bucket_size: 1
In such settings, Tusk performed all at once, simultaneously.
Specificity is that between the requests should be delayed at least 5
sec. Task must perform on stage with a difference> 5 sec. (but
does not parallel).
What are the values set in queue.yaml?
You can't specify minimum delays between tasks in queue.yaml, currently; you should do it (partly) in your own code. For example, if you specify a bucket size of 1 (so that more than one task should never be executing at once) and make sure the tasks runs for at least 5 seconds (get a start=time.time() at the start, time.sleep(time.time()-(5+start)) at the end) this should work. If it doesn't, have each task record in the store the timestamp it finished, and when it start check if the last task ended less than 5 seconds ago, and in that case terminate immediately.
The other way could be store the task data in table. In your task-queue add a id parameter. Fetch 1st task from table and pass its id to task queue processing servlet. In servlet at the end delay for 5 second and feth next task, pass its id and.... so on.

How do I measure response time in seconds given the following benchmarking data?

We recently got some data back on a benchmarking test from a software vendor, and I think I'm missing something obvious.
If there were 17 transactions (I assume they mean successfully completed requests) per second, and 1500 of these requests could be served in 5 minutes, then how do I get the response time for a single user? Is this sort of thing even possible with benchmarking? I have a lot of other data from them, including apache config settings, but I'm not sure how to do all the math.
Given the server setup they sent, I want to know how I can deduce the user response time. I have looked at other similar benchmarking tests, but I'm having trouble measuring requests to response time. What other data do I need to provide here to get that?
If only 1500 of these can be served per 5 minutes then:
1500 / 5 = 300 transactions per min can be served
300 / 60 = 5 transactions per second can be served
so how are they getting 17 completed transactions per second? Last time I checked 5 < 17 !
This doesn't seem to fit. Or am I looking at it wrongly?
I presume be user response time, you mean the time it takes to serve a single transaction:
If they can serve 5 per second than it takes 200ms (1/5) per transaction
If they can serve 17 per second than it takes 59ms (1/17) per transaction
That is all we can tell from the given data. Perhaps clarify how many transactions are being done per second.

Resources