Is number of tasks same as the number of fits for GridSearchCV Logistic Regression? - logistic-regression

I am training a Logistic Regression model with GridSearchCV. The log says:
Fitting 3 folds for each of 1600 candidates, totalling 4800 fits
Further, for tasks below line in printed in log:
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 2.9min
Is the number of tasks here(like 42 tasks as above) same as the number of fits(=4800)?
I want to estimate the time taken to finish the training?

Deciphering it step by step
Fitting 3 folds for each of 1600 candidates, totaling 4800 fits
1600 candidates means you are trying out 1600 combinations
Fitting 3 folds means you specified cv=3, you are cross-validating 3 times on training data.
totaling 4800 fits = 1600 * 3. i.e we have 4800 tasks
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 2.9min
Parallel(n_jobs=-1), -1 means you are running on all cores of your CPU
Done 42 tasks meand out of 4800, 42 fits have been completed
elapsed: 2.9min - from the time execution started it took 2.9min for completing 42 fits/ 42 training
Let me know if you still have any doubts.

Related

Consumption Rate using a small dataset with variability

I am looking to find the consumption rate, or how fast I am consuming energy so that I can later predict when my energy will reach a certain threshold. My dataset is fairly small and looking to see what the best method is to use:
Data
date Energy_used
4/1/2021 877
5/1/2021 898
6/1/2021 940
7/1/2021 962
8/1/2021 950
9/1/2021 999
10/1/2021 1008
11/1/2021 1136
12/1/2021 1108
1/1/2022 1213
2/1/2022 1222
3/1/2022 1175
4/1/2022 1108
5/1/2022 1144
6/1/2022 1182
7/1/2022 1149
8/1/2022 1124
Doing
We know we can discover how fast we are by using a formula that takes the:
(start - end)/ # of months
for example if we look over a 3 month period, we see 877 - 940/3(months) = -21 (From April to June 2021)
From this calculation we are burning through 21 units of energy per month. However, how would I compensate for the variability within this dataset. The energy fluctuates and is not always constant. What is the best methodology to determine how fast I am burning through this energy usage?
Any suggestion is appreciated

How do I inject the reliability of machines into my ANN

I would like to predict the Reliability of my physical machines by ANN.
Q1) What is the right metric that measure the reliability for repairable machine.
Q2) In order to calculate the reliability of each machine in each time period or row should I calculate TBF or MTBF, and feed my ANN.
Q3) Is ANN a good machine learning approach to solve my issue
Lets take a look.
In my predictor ANN. One of the input is the current reliability value for my physical machines by applying the right distribution function with right metric MTBF or MTTF. In sample data, there are two machines with some log events.
Time , machine ID, and event_type. event_type = 0 when a machine became available to the cluster, event_type=1 machine failed, and when event_type=2 when a machine available to the cluster had its available resources changed.
For non-repairable product MTTF is preferred to use to measure the reliability, and MTBF is for repairable product.
What is the right metric to get the current reliability value for each time period row , is it TBF or MTBF . Previously I use MTTF= TOTAL UPTIME/TOTAL NUMBER OF FAILURE. To calculate the UPTIME, I subtract the time in event_type = 1 from first previous time in event_type=0, and so on, then divide the total UPTIME by number of failure. Or I need to TBF for each row. Machine events table looks like:
time machine_id event_type R()
0 6640223 0
30382.66466 6640223 1
30399.2805 6640223 0
37315.23415 6640223 1
37321.64514 6640223 0
0 3585557842 0
37067.13354 3585557842 1
37081.0917 3585557842 0
37081.2932 3585557842 2
37321.33633 3585557842 2
37645.77424 3585557842 1
37824.73506 3585557842 0
37824.73506 3585557842 2
41666.42118 3585557842 2
After Preprocessing previous table of machine events to get input_2 (Reliability) to the training data table the expected table should be look like:
start_time machine_id input_x1 input_2_(Relibility) Predicied_output_Relibility
0 111 0.06 xx.xx
1 111 0.04 xx.xx
2 111 0.06 xx.xx
3 111 0.55 xx.xx
0 222 0.06 xx.xx
1 222 0.06 xx.xx
2 222 0.86 xx.xx
3 222 0.06 xx.xx
mean time TO failure
It is (or should be) a predictor of equipment reliability. The TO in that term indicates it's predictive intent.
Mean time to failure (MTTF) is the length of time a device or other
product is expected to last in operation. MTTF is one of many ways to
evaluate the reliability of pieces of hardware or other technology.
https://www.techopedia.com/definition/8281/mean-time-to-failure-mttf
e.g.
take total hours of operation of same equipment items
divided by the number of failures of those items
If there is 100 items, all except one operate for 100 hours.
One failure happens 50 hours.
MTTF = (( 99 items x 100hrs ) + (1 item x 50 hrs)) / 1 failure = 9950 hours
----
I believe you have been calculating MTBF
mean time BETWEEN failures
This measure is based on recorded events.
Mean time between failure (MTBF) refers to the average amount of time
that a device or product functions before failing. This unit of
measurement includes only operational time between failures and does
not include repair times, assuming the item is repaired and begins
functioning again. MTBF figures are often used to project how likely a
single unit is to fail within a certain period of time.
https://www.techopedia.com/definition/2718/mean-time-between-failures-mtbf
the MTBF of a component is the sum of the lengths of the operational
periods divided by the number of observed failures
https://en.wikipedia.org/wiki/Mean_time_between_failures
In short the data you have in that table is suited to MTBF calculation, in the manner you have been doing it. I'm not sure what the lambda reference would be discussing.

Why is total time taken by Google Dataflow more than sum of times taken by individual steps

I am really unable to understand why is the total elapsed time for a dataflow job so much higher than time taken by individual steps.
For example, total elapsed time for the dataflow in picture is 2 min 39 sec. While time spent in individual steps is just 10 sec. Even if we consider the time spent in setup and destroy phases, there is a difference of 149 secs, which is too much.
Is there some other way of reading the individual stage timing or I am missing something else?
Thanks
According to me 2 min 39 sec time is fine. You are doing this operation reading file and then pardo and then writting it to bigquery.
There are lot of factor involved in this time calculation.
How much data you need to process. i.e - in your case I don't think you are processing much data.
What computation you are doing. i.e your pardo step is only 3 sec so apart from small amount of data pardo do not have much computation as well.
Writing it to bigquery - i.e in your case it is taking only 5 sec.
So creation and destroy phases of the dataflow remains constant. In your case it is 149 sec. Your job is taking only 10 sec that is dependent on all three factor I explained above.
Now let assume that you have to process 2 million record And each record transform take 10 sec. In this case the time will be much higher i.e 10 sec * 2 million records for single node dataflow load job.
So in this case 149 sec didn't stands in-front of whole job completion time as 149 sec is considered for all record process 0 sec * 2 million records.
Hope these information help you to understand the timing.

A way to effectively remove outliers from a big array in matlab

So in my software that I am developing, at some point, I have a big array of around 250 elements. I am taking the average of those elements to obtain one mean value. The problem is I have outliers in this big array at the beginning and at the end. So for instance the array could be:
A = [150 200 250 300 1100 1106 1130 1132 1120 1125 1122 1121 1115 2100 2500 2400 2300]
So in this case I would like to remove 150 200 250 300 2100 2500 2400 2300 from the array...
I know I could set those indexes to zero but however, I need a way to automatically program the software to remove those outliers no matter how many there are at the start or and at the end.
Can anyone suggest a robust way of removing those outliers?
You can do something like:
A(A>(mean(A)-std(A)) & A<(mean(A)+std(A)))
> ans = 1100 1106 1130 1132 1120 1125 1122 1121 1115
Normally a robust estimator works better with outliers (https://en.wikipedia.org/wiki/Robust_statistics). The estimated mean and std will change a lot if the outliers are very large. I prefer to use the median and the median absolute deviation (https://en.wikipedia.org/wiki/Median_absolute_deviation).
med = median(A)
mad = median(abs(med-A))
out = (A <med - 3*mad) | (A > med + 3*mad)
A[out] = []
It depends too a lot in what your data represents and how the distribution looks (hist(A)). For example, if your data is skewed to large values you could remove the top 0.95 of the values or something similar. Sometimes do a transformation to make the distribution resemble a normal-distribution works better. For example if the distribution is skewed to the right use a log-transform.
I use a referral approach in this case. I can pick up e.g. 15 elements from a middle of the array, calculate average/median and than compare it to std or diff(A(end-1:end)). Actually try to use median instead of mean.

Average measurement over a few devices

I have a few devices that emit time series data:
[deviceID],[time],[value]
I am using graphite to keep track of this data but the question applies to other databases as well.
I have defined my data retention/precision to be 5 seconds - so each device will only have one value per 5 seconds which is the average of all the observations it had made during this period. For example if these are the real measurements:
device1 1/1/2012 08:00:00 12
device1 1/1/2012 08:00:01 10
device2 1/1/2012 08:00:01 2
device1 1/1/2012 08:00:02 14
Then the data saved will be:
device1 1/1/2012 08:00:00 12
device2 1/1/2012 08:00:00 2
How could I query for the average value across both devices in this time period? I can't just take their average over the saved data (=7) since it is biased down because it does not consider that device1 had more measurements. Do I need to keep track of the avg for every device pair/trio? Maybe it is best not to do aggregations at all and get maximum flexibility? Or is it accepted to not allow such cross-device queries if this is just a nice to have feature?
Have you considered calculating a weighted mean?
A simple example would be like this:
(No of measurements of d1)*d1 measurement + (No of measurements of d2)*d2 measurement
_____________________________________________________________________________________
Total number of measurements of d1 & d2
This measurement will take into account the number of measurements of each device and so will not be biased downwards.

Resources