Average measurement over a few devices - analytics

I have a few devices that emit time series data:
[deviceID],[time],[value]
I am using graphite to keep track of this data but the question applies to other databases as well.
I have defined my data retention/precision to be 5 seconds - so each device will only have one value per 5 seconds which is the average of all the observations it had made during this period. For example if these are the real measurements:
device1 1/1/2012 08:00:00 12
device1 1/1/2012 08:00:01 10
device2 1/1/2012 08:00:01 2
device1 1/1/2012 08:00:02 14
Then the data saved will be:
device1 1/1/2012 08:00:00 12
device2 1/1/2012 08:00:00 2
How could I query for the average value across both devices in this time period? I can't just take their average over the saved data (=7) since it is biased down because it does not consider that device1 had more measurements. Do I need to keep track of the avg for every device pair/trio? Maybe it is best not to do aggregations at all and get maximum flexibility? Or is it accepted to not allow such cross-device queries if this is just a nice to have feature?

Have you considered calculating a weighted mean?
A simple example would be like this:
(No of measurements of d1)*d1 measurement + (No of measurements of d2)*d2 measurement
_____________________________________________________________________________________
Total number of measurements of d1 & d2
This measurement will take into account the number of measurements of each device and so will not be biased downwards.

Related

Consumption Rate using a small dataset with variability

I am looking to find the consumption rate, or how fast I am consuming energy so that I can later predict when my energy will reach a certain threshold. My dataset is fairly small and looking to see what the best method is to use:
Data
date Energy_used
4/1/2021 877
5/1/2021 898
6/1/2021 940
7/1/2021 962
8/1/2021 950
9/1/2021 999
10/1/2021 1008
11/1/2021 1136
12/1/2021 1108
1/1/2022 1213
2/1/2022 1222
3/1/2022 1175
4/1/2022 1108
5/1/2022 1144
6/1/2022 1182
7/1/2022 1149
8/1/2022 1124
Doing
We know we can discover how fast we are by using a formula that takes the:
(start - end)/ # of months
for example if we look over a 3 month period, we see 877 - 940/3(months) = -21 (From April to June 2021)
From this calculation we are burning through 21 units of energy per month. However, how would I compensate for the variability within this dataset. The energy fluctuates and is not always constant. What is the best methodology to determine how fast I am burning through this energy usage?
Any suggestion is appreciated

How can I filter for records whose sum make up X% of the total in Google Data Studio?

I have a data set containing clients and two dimensions: Lifetime and Revenue, e.g.
Client
Lifetime
Revenue
Tesla
7,4
280
Amazon
9,2
450
Disney
2,6
130
Otto
11,8
940
BMW
3,5
170
I am trying to calculate the average lifetime, but instead of calculating the average over all clients, I only want to include the top clients (regarding revenue) whose sum of revenue make up 80% of the total revenue.
When doing this manually the process would be quite clear:
Sort the clients by revenue descending
Calculate the total revenue and thus what 80% of
the total is
Check at which row the running sum of the client's revenue exceeds that 80% and only include the records up to that point
This is what the resulting table would look like:
Client
Lifetime
Revenue
Running Sum of Rev
Otto
11,8
940
940
Amazon
9,2
450
1390
Tesla
7,4
280
1670
BMW
3,5
170
1840
Disney
2,6
130
1970
Since the total revenue is 1970, 80% of that would amount to 1576. Therefore I would want to select the top 3 (Otto, Amazon, Tesla) from the set, because their running sum (1670) is just bigger than the 1576.
Then I could calculate the average lifetime of those three clients only.
I have no idea if this is possible in an automated way in Google Data Studio. Filters can only take absolute values and in datasets/blends the order is also not regarded.
Alternative:
I would already be quite happy with something like including only the top 80% clients from the list of clients sorted by revenue (i.e. the 80% quantile). This would be the top 4 in the example (0.8 * 5 = 4), instead of the top 3 as of the original question, but that seems equally impossible.
Data Set (Google Sheets)
Google Data Studio Report

Can I carry out a Propensity Score Matching with a general population of 90 observations and a treatment group of 20?

My population consists of 90 administrative zones that divide the city. Of those zones, only 20 received the treatment. After carrying out PSM, I have 17 zones in the treatment group and 17 in the control group, with a high level of balance on co-variables.
I know scholars recommend a minimum of 100 observations in each group, but considering the size of my population of interest (90), is it okay if I proceed with my research? I have already invested a substantial amount of work getting this far.

How do I inject the reliability of machines into my ANN

I would like to predict the Reliability of my physical machines by ANN.
Q1) What is the right metric that measure the reliability for repairable machine.
Q2) In order to calculate the reliability of each machine in each time period or row should I calculate TBF or MTBF, and feed my ANN.
Q3) Is ANN a good machine learning approach to solve my issue
Lets take a look.
In my predictor ANN. One of the input is the current reliability value for my physical machines by applying the right distribution function with right metric MTBF or MTTF. In sample data, there are two machines with some log events.
Time , machine ID, and event_type. event_type = 0 when a machine became available to the cluster, event_type=1 machine failed, and when event_type=2 when a machine available to the cluster had its available resources changed.
For non-repairable product MTTF is preferred to use to measure the reliability, and MTBF is for repairable product.
What is the right metric to get the current reliability value for each time period row , is it TBF or MTBF . Previously I use MTTF= TOTAL UPTIME/TOTAL NUMBER OF FAILURE. To calculate the UPTIME, I subtract the time in event_type = 1 from first previous time in event_type=0, and so on, then divide the total UPTIME by number of failure. Or I need to TBF for each row. Machine events table looks like:
time machine_id event_type R()
0 6640223 0
30382.66466 6640223 1
30399.2805 6640223 0
37315.23415 6640223 1
37321.64514 6640223 0
0 3585557842 0
37067.13354 3585557842 1
37081.0917 3585557842 0
37081.2932 3585557842 2
37321.33633 3585557842 2
37645.77424 3585557842 1
37824.73506 3585557842 0
37824.73506 3585557842 2
41666.42118 3585557842 2
After Preprocessing previous table of machine events to get input_2 (Reliability) to the training data table the expected table should be look like:
start_time machine_id input_x1 input_2_(Relibility) Predicied_output_Relibility
0 111 0.06 xx.xx
1 111 0.04 xx.xx
2 111 0.06 xx.xx
3 111 0.55 xx.xx
0 222 0.06 xx.xx
1 222 0.06 xx.xx
2 222 0.86 xx.xx
3 222 0.06 xx.xx
mean time TO failure
It is (or should be) a predictor of equipment reliability. The TO in that term indicates it's predictive intent.
Mean time to failure (MTTF) is the length of time a device or other
product is expected to last in operation. MTTF is one of many ways to
evaluate the reliability of pieces of hardware or other technology.
https://www.techopedia.com/definition/8281/mean-time-to-failure-mttf
e.g.
take total hours of operation of same equipment items
divided by the number of failures of those items
If there is 100 items, all except one operate for 100 hours.
One failure happens 50 hours.
MTTF = (( 99 items x 100hrs ) + (1 item x 50 hrs)) / 1 failure = 9950 hours
----
I believe you have been calculating MTBF
mean time BETWEEN failures
This measure is based on recorded events.
Mean time between failure (MTBF) refers to the average amount of time
that a device or product functions before failing. This unit of
measurement includes only operational time between failures and does
not include repair times, assuming the item is repaired and begins
functioning again. MTBF figures are often used to project how likely a
single unit is to fail within a certain period of time.
https://www.techopedia.com/definition/2718/mean-time-between-failures-mtbf
the MTBF of a component is the sum of the lengths of the operational
periods divided by the number of observed failures
https://en.wikipedia.org/wiki/Mean_time_between_failures
In short the data you have in that table is suited to MTBF calculation, in the manner you have been doing it. I'm not sure what the lambda reference would be discussing.

A way to effectively remove outliers from a big array in matlab

So in my software that I am developing, at some point, I have a big array of around 250 elements. I am taking the average of those elements to obtain one mean value. The problem is I have outliers in this big array at the beginning and at the end. So for instance the array could be:
A = [150 200 250 300 1100 1106 1130 1132 1120 1125 1122 1121 1115 2100 2500 2400 2300]
So in this case I would like to remove 150 200 250 300 2100 2500 2400 2300 from the array...
I know I could set those indexes to zero but however, I need a way to automatically program the software to remove those outliers no matter how many there are at the start or and at the end.
Can anyone suggest a robust way of removing those outliers?
You can do something like:
A(A>(mean(A)-std(A)) & A<(mean(A)+std(A)))
> ans = 1100 1106 1130 1132 1120 1125 1122 1121 1115
Normally a robust estimator works better with outliers (https://en.wikipedia.org/wiki/Robust_statistics). The estimated mean and std will change a lot if the outliers are very large. I prefer to use the median and the median absolute deviation (https://en.wikipedia.org/wiki/Median_absolute_deviation).
med = median(A)
mad = median(abs(med-A))
out = (A <med - 3*mad) | (A > med + 3*mad)
A[out] = []
It depends too a lot in what your data represents and how the distribution looks (hist(A)). For example, if your data is skewed to large values you could remove the top 0.95 of the values or something similar. Sometimes do a transformation to make the distribution resemble a normal-distribution works better. For example if the distribution is skewed to the right use a log-transform.
I use a referral approach in this case. I can pick up e.g. 15 elements from a middle of the array, calculate average/median and than compare it to std or diff(A(end-1:end)). Actually try to use median instead of mean.

Resources