95th and 99th percentile latency - database

I am measuring database performance and am looking at p95 and p99 latency.
My results are as follows.
Database A shows:
95thPercentileLatency(ms) 20
99thPercentileLatency(ms) 28
Database B shows:
95thPercentileLatency(ms) 1
99thPercentileLatency(ms) 3
I understand that the 99th percentile latency indicates that 99% of operations completed in under a given latency, so for database B 3
I am unsure of the significance of 1 system having a 99th and 95th percentile latency that are very close to each other, and another where they are much further apart. What does this mean?

Related

A performance issue result from "limit 0" in TDengine database

limit 0 is suspected to cause the full table query BUG.
After switching from 2.6.0.32 to 3.0.2.1 today, it was found that the CPU usage of the three nodes (each node uses a CPU with 32 cores) exceeded 90%, while in the original 2.6.0.32 environment, the CPU usage has not yet More than 10%, the figure below is one of the nodes.
View through show queries and found that there are two select * from t XXX limit 0;
In the two environments, the discovery time difference is more than 30,000 times, and the 3.0.2.1 time is as follows:
2.6.0.32 time is as follows:
Next, after changing the statement to select * from t XXX limit 1 in the 3.0.2.1 environment, the time spent dropped from 74 seconds to 0.03 seconds and returned to normal, as shown in the figure below.
Finally, the comparison chart of the two environments is released (after the CPU usage rate of 3.0 is reduced, the query speed has been improved)
In addition, the configuration and table structure of the two environments are the same. In terms of details, the 18ms of the query in 2.6.0.32 is still lower than the 30ms of 3.0.2.1.

How do I inject the reliability of machines into my ANN

I would like to predict the Reliability of my physical machines by ANN.
Q1) What is the right metric that measure the reliability for repairable machine.
Q2) In order to calculate the reliability of each machine in each time period or row should I calculate TBF or MTBF, and feed my ANN.
Q3) Is ANN a good machine learning approach to solve my issue
Lets take a look.
In my predictor ANN. One of the input is the current reliability value for my physical machines by applying the right distribution function with right metric MTBF or MTTF. In sample data, there are two machines with some log events.
Time , machine ID, and event_type. event_type = 0 when a machine became available to the cluster, event_type=1 machine failed, and when event_type=2 when a machine available to the cluster had its available resources changed.
For non-repairable product MTTF is preferred to use to measure the reliability, and MTBF is for repairable product.
What is the right metric to get the current reliability value for each time period row , is it TBF or MTBF . Previously I use MTTF= TOTAL UPTIME/TOTAL NUMBER OF FAILURE. To calculate the UPTIME, I subtract the time in event_type = 1 from first previous time in event_type=0, and so on, then divide the total UPTIME by number of failure. Or I need to TBF for each row. Machine events table looks like:
time machine_id event_type R()
0 6640223 0
30382.66466 6640223 1
30399.2805 6640223 0
37315.23415 6640223 1
37321.64514 6640223 0
0 3585557842 0
37067.13354 3585557842 1
37081.0917 3585557842 0
37081.2932 3585557842 2
37321.33633 3585557842 2
37645.77424 3585557842 1
37824.73506 3585557842 0
37824.73506 3585557842 2
41666.42118 3585557842 2
After Preprocessing previous table of machine events to get input_2 (Reliability) to the training data table the expected table should be look like:
start_time machine_id input_x1 input_2_(Relibility) Predicied_output_Relibility
0 111 0.06 xx.xx
1 111 0.04 xx.xx
2 111 0.06 xx.xx
3 111 0.55 xx.xx
0 222 0.06 xx.xx
1 222 0.06 xx.xx
2 222 0.86 xx.xx
3 222 0.06 xx.xx
mean time TO failure
It is (or should be) a predictor of equipment reliability. The TO in that term indicates it's predictive intent.
Mean time to failure (MTTF) is the length of time a device or other
product is expected to last in operation. MTTF is one of many ways to
evaluate the reliability of pieces of hardware or other technology.
https://www.techopedia.com/definition/8281/mean-time-to-failure-mttf
e.g.
take total hours of operation of same equipment items
divided by the number of failures of those items
If there is 100 items, all except one operate for 100 hours.
One failure happens 50 hours.
MTTF = (( 99 items x 100hrs ) + (1 item x 50 hrs)) / 1 failure = 9950 hours
----
I believe you have been calculating MTBF
mean time BETWEEN failures
This measure is based on recorded events.
Mean time between failure (MTBF) refers to the average amount of time
that a device or product functions before failing. This unit of
measurement includes only operational time between failures and does
not include repair times, assuming the item is repaired and begins
functioning again. MTBF figures are often used to project how likely a
single unit is to fail within a certain period of time.
https://www.techopedia.com/definition/2718/mean-time-between-failures-mtbf
the MTBF of a component is the sum of the lengths of the operational
periods divided by the number of observed failures
https://en.wikipedia.org/wiki/Mean_time_between_failures
In short the data you have in that table is suited to MTBF calculation, in the manner you have been doing it. I'm not sure what the lambda reference would be discussing.

Why is total time taken by Google Dataflow more than sum of times taken by individual steps

I am really unable to understand why is the total elapsed time for a dataflow job so much higher than time taken by individual steps.
For example, total elapsed time for the dataflow in picture is 2 min 39 sec. While time spent in individual steps is just 10 sec. Even if we consider the time spent in setup and destroy phases, there is a difference of 149 secs, which is too much.
Is there some other way of reading the individual stage timing or I am missing something else?
Thanks
According to me 2 min 39 sec time is fine. You are doing this operation reading file and then pardo and then writting it to bigquery.
There are lot of factor involved in this time calculation.
How much data you need to process. i.e - in your case I don't think you are processing much data.
What computation you are doing. i.e your pardo step is only 3 sec so apart from small amount of data pardo do not have much computation as well.
Writing it to bigquery - i.e in your case it is taking only 5 sec.
So creation and destroy phases of the dataflow remains constant. In your case it is 149 sec. Your job is taking only 10 sec that is dependent on all three factor I explained above.
Now let assume that you have to process 2 million record And each record transform take 10 sec. In this case the time will be much higher i.e 10 sec * 2 million records for single node dataflow load job.
So in this case 149 sec didn't stands in-front of whole job completion time as 149 sec is considered for all record process 0 sec * 2 million records.
Hope these information help you to understand the timing.

How should i evaluate the insert benchmark from CrateDB?

I am trying to understand and interpret the benchmark which is provided from CrateDB. (https://staging.crate.io/benchmark/)
I am interested on how many elements can be inserted during one second.
I know that this may vary on the size of the tuples. And I would define that I have the same ements-sizes as CrateDB uses in their exmpale.
They provide an exmaple for bulk-insertion and there it takes on average 50 milliseconds to insert a bulk of 10.000 (integer/string pairs).
Now, can I calculate that it is possible to insert 20 bulks of 10.000 pairs during 1s (1000 milliseconds)?
1000ms/50ms = 20 -> 20*10000 = 200000 -> 200000 integer/string pairs per second
Can I say how the result would differ, if i have 7 integers and 2 decimals(7,4)?
Well this: https://staging.crate.io/benchmark/ is only comparable to itself, so it will show if code changes/features made CrateDB slower/faster. It's not a reliable source for actual benchmarking and won't give you comparable numbers (among other things because the setup is vanilla).
As for your question, I recommend running your own benchmarks to satisfy whatever requirements you have. 😬
The tool we used for these benchmarks is cr8 from one of our core devs!
Cheers, Claus

Tokyo cabinet - Slower inserts after hitting 1million

I am evaluating tokyo cabinet Table engine. The insert rate slows down considerable after hitting 1 million records. Batch size is 100,000 and is done within transaction. I tried setting the xmsiz but still no use. Has any one faced this problem with tokyo cabinet?
Details
Tokyo cabinet - 1.4.3
Perl bindings - 1.23
OS : Ubuntu 7.10 (VMWare Player on top of Windows XP)
I hit a brick wall around 1 million records per shard as well (sharding on the client side, nothing fancy). I tried various ttserver options and they seemed to make no difference, so I looked at the kernel side and found that
echo 80 > /proc/sys/vm/dirty_ratio
(previous value was 10) gave a big improvement - the following is the total size of the data (on 8 shards, each on its own node) printed every minute:
total: 14238792 records, 27.5881 GB size
total: 14263546 records, 27.6415 GB size
total: 14288997 records, 27.6824 GB size
total: 14309739 records, 27.7144 GB size
total: 14323563 records, 27.7438 GB size
(here I changed the dirty_ratio setting for all shards)
total: 14394007 records, 27.8996 GB size
total: 14486489 records, 28.0758 GB size
total: 14571409 records, 28.2898 GB size
total: 14663636 records, 28.4929 GB size
total: 14802109 records, 28.7366 GB size
So you can see that the improvement was in the order of 7-8 times. Database size was around 4.5GB per node at that point (including indexes) and the nodes have 8GB RAM (so dirty_ratio of 10 meant that the kernel tried to keep less than ca. 800MB dirty).
Next thing I'll try is ext2 (currently: ext3) and noatime and also keeping everything on a ramdisk (that would probably waste twice the amount of memory, but might be worth it).
I just set the cache option and it is now significantly faster.
I think modifying the bnum parameter in the dbtune function will also give a significant speed improvement.

Resources