Solr Metrics Number confusion - solr

We are wanting to start using the Solr metrics, example web API call below:
http://localhost:8983/solr/admin/metrics?type=counter&group=core.bookings
We get results back like this from our master
We get metrics like this from our slave that does all the serving
The question is what are the totalTime values measure in? If we presume they are milliseconds, that would mean that each request was taking 0.8 of a day, and that is just not right! Can someone help describe what units of measure the totalTime is in?

After tracing the source, this is based on the value returned from the metrics timing object, which is given in nanoseconds:
long elapsed = timer.stop();
metrics.totalTime.inc(elapsed);
.. so it's nanoseconds, not milliseconds. This seems to make sense if we take the numbers and divided them:
>>> total = 718781574846646
>>> requests = 6789073
>>> total / requests
105873301.82583779
>>> time_per_request = total / requests
>>> time_per_request / 1_000_000_000
0.10587330182583779
.. so about 0.1s per request on average.

Related

Based on Gatling report, how to make sure 100 requests are processed in less than 1 second

how can I check my requirement of 100 requests are processed in less than 1 second in my gatling3 report. I ran this using jenkins.
my simulation looks like as below
rampConcurrentUsers(1) to (100) during (161 second),
constantConcurrentUsers(100) during (1 minute)
Below is my response time percentile graph of two executions for an interval of one second.
enter image []1 here
What does the min,max here will tell us, i am assuming the percentages 25%-99% are the completion of the request.
Those graph sections are not what you're after - they show the distribution of response times and the number of active users.
So min is the fastest response time
max is the longest
95% is the response time for which 95% of your requests were under
and so on...
So what you could do is look at the section of your graph corresponding to the 100 constant concurrent users injection stage. In this part you would require that the max response time always be under 1 second
(Note: there's something odd with your 2nd report - I assume it didn't come from running the stated injection profile as it has more than 100 concurrent users active)

get_by_key_name() in GAE taking as long as 750ms. Is this expected?

My program fetches ~100 entries in a loop. All entries are fetched using get_by_key_name(). Appstats show that some get_by_key_name() requests are taking as much as 750ms! (other big values are 355ms, 260ms, 230ms). Average for other fetches ranges from 30ms to 100ms. These times are in real_time and hence contribute towards 'ms' and not 'cpu_ms'.
Due to the above, total time taken to return the webpage is very high ms=5754, where cpu_ms=1472. (above times are seen repeatedly for back to back requests.)
Environment: Python 2.7, webapp2, jinja2, High Replication, No other concurrent requests to the server, Frontend Instance Class is F1, No memcache set yet, max idle instances is automatic, min pending latency is automatic, using db (NOT NDB).
Any help will be greatly appreciated as I based whole database design on fetching entries from the datastore using only get_by_key_name()!!
Update:
I tried profiling using time.clock() before and immediately after every get_by_key_name() method call. The difference I get from time.clock() for every single call is 10ms! (Just want to clarify that the get_by_key_name() is called on different Kinds).
According to time.clock() the total execution time (in wall-clock time) is 660ms. But the real-time is 5754 (=ms), and cpu_ms is 1472 per GAE logs.
Summary of Questions:
*[Update: This was addressed by passing list of keys] Why get_by_key_name() is taking that long?*
Why ms of 5754 is so much more than cpu_ms of 1472. Is task execution in halted/waiting-state for 75% (1-1472/5754) of the time due to which real-time (wall clock) time taken is so long as far as end user is concerned?
If the above is true, then why time.clock() shows that only 660ms (wall-clock time) elapsed between start of the first get_by_key_name() request and the last (~100th) get_by_key_name() request; although GAE shows this time as 5754ms?

Newrelic custom plugin metrics

I'm working directly with the HTTP API and trying to get some metrics from our storage.
The doc states "Tip: If you want the metric to appear as a percentage in the user interface, then you must define it as a percentage in the JSON."
However - I can't send metric values which are percentages; the POST response has status 400 with body
{"error":"Unable to parse request: null"}
My POST is
{"components": [
{"duration": 1,
"guid": "com.cumulus.Test5",
"name":"ServerX",
"metrics": {
"Component/Filesystem/root/Percentage Used": "62%"
}
}],
"agent": {"host": "vss-syd", "version": "1.0.0", "pid": 1080}
}
Also - I have a metric "Number of devices offline" (for a ZFS storage pool) which is discrete i.e. not continuous - so averages don't make sense, just absolute values.
For which I'd like to set an alert if it gets above 0.
I know the threshold is only 'greater than', so I can set thresholds # 0.1 Alert & 0.2 Critical no prob.
However - please can someone point me in the right direction as to how I should
Send such a metric (i.e. need to specify [units] and aggregates?)
Create the Summary Metric + Graphs in the frontend? (which 'Value' to select e.g. 'Calls per minute')
There are two issues that look like they could be the cause.
The first is that the duration should be 60, which represents the number of seconds for which the reported metrics correspond. NewRelic is optimized to work with this particular interval and while you can have larger values (300 seconds is the recommended maximum), the minimum required value is 60. Smaller values may be accepted by the API, but the results will be unpredictable.
The second is that the percentage used is a string value which should instead be reported as an integer value, such as 62, or a float value of 62.0 if you wish to preserve that level of precision.
Regarding the second portion of your question about reporting and displaying a metric related to "# of Failing Disks":
New Relic does not currently support reporting metrics that represent absolute values. All metric values are presented in aggregate over some particular time period. Summary Metrics are aggregated over the most recent ~4 minutes, while metrics on charts and tables are aggregated over the time period selected in the time picker.
That said, you could try something along the lines of "percentage of failing disks" where perhaps an average might still be useful in that any non-zero value indicates a failure.
This average would be of questionable value once the aggregation time period became larger than a few minutes. However, given that summary metrics are always aggregated over a fixed time period of ~4 minutes — and it is summary metrics that trigger alerts — this may still be useful to you.

Count Method Throwing 500 error

I have got into strenge situation. I want to know count of Fetch based on daily, weekly, monthly and All time. In the Datastore, the count is about 2,368,348. Whenever I try to get the count either by Model or GqlQuery I get a 500 error. When rows are less, the code below is working fine.
Can any guru correct me or tell me right solution, please? I am using Python.
The Model:
class Fetch(db.Model):
adid = db.IntegerProperty()
ip = db.StringProperty()
date = db.DateProperty(auto_now_add=True)
Stats Codes:
adid = cgi.escape(self.request.get('adid'))
...
query = "SELECT __key__ FROM Fetch WHERE adid = " + adid + " AND date >= :1"
rows = db.GqlQuery( query, monthlyDate)
fetch_count = 0
for row in rows:
fetch_count = fetch_count + 1
self.response.out.write( fetch_count)
It looks like your query is taking longer than GAE allows a query to run (typically ~60 seconds). From the count() documentation:
Unless the result count is expected to be small, it is best to specify a limit argument; otherwise the method will continue until it finishes counting or times out.
From the Request Timer documentation:
A request handler has a limited amount of time to generate and return a response to a request, typically around 60 seconds. Once the deadline has been reached, the request handler is interrupted.
If a DeadlineExceededError is being raised, this is your problem. If you need to run this query consider using Backends in GAE. With Backends there is no time limit for generating and returning a request.

How do I measure response time in seconds given the following benchmarking data?

We recently got some data back on a benchmarking test from a software vendor, and I think I'm missing something obvious.
If there were 17 transactions (I assume they mean successfully completed requests) per second, and 1500 of these requests could be served in 5 minutes, then how do I get the response time for a single user? Is this sort of thing even possible with benchmarking? I have a lot of other data from them, including apache config settings, but I'm not sure how to do all the math.
Given the server setup they sent, I want to know how I can deduce the user response time. I have looked at other similar benchmarking tests, but I'm having trouble measuring requests to response time. What other data do I need to provide here to get that?
If only 1500 of these can be served per 5 minutes then:
1500 / 5 = 300 transactions per min can be served
300 / 60 = 5 transactions per second can be served
so how are they getting 17 completed transactions per second? Last time I checked 5 < 17 !
This doesn't seem to fit. Or am I looking at it wrongly?
I presume be user response time, you mean the time it takes to serve a single transaction:
If they can serve 5 per second than it takes 200ms (1/5) per transaction
If they can serve 17 per second than it takes 59ms (1/17) per transaction
That is all we can tell from the given data. Perhaps clarify how many transactions are being done per second.

Resources